annotate 0.7.0/readme/bettercallsal.md @ 21:4ce0e079377d tip

planemo upload
author kkonganti
date Mon, 15 Jul 2024 12:01:00 -0400
parents 0e7a0053e4a6
children
rev   line source
kkonganti@17 1 # bettercallsal
kkonganti@17 2
kkonganti@17 3 `bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture.
kkonganti@17 4
kkonganti@17 5 \
kkonganti@17 6  
kkonganti@17 7
kkonganti@17 8 <!-- TOC -->
kkonganti@17 9
kkonganti@17 10 - [Minimum Requirements](#minimum-requirements)
kkonganti@17 11 - [CFSAN GalaxyTrakr](#cfsan-galaxytrakr)
kkonganti@17 12 - [Usage and Examples](#usage-and-examples)
kkonganti@17 13 - [Database](#database)
kkonganti@17 14 - [Input](#input)
kkonganti@17 15 - [Output](#output)
kkonganti@17 16 - [Computational resources](#computational-resources)
kkonganti@17 17 - [Runtime profiles](#runtime-profiles)
kkonganti@17 18 - [your_institution.config](#your_institutionconfig)
kkonganti@17 19 - [Cloud computing](#cloud-computing)
kkonganti@17 20 - [Example data](#example-data)
kkonganti@17 21 - [Using sourmash](#using-sourmash)
kkonganti@17 22 - [bettercallsal CLI Help](#bettercallsal-cli-help)
kkonganti@17 23
kkonganti@17 24 <!-- /TOC -->
kkonganti@17 25
kkonganti@17 26 \
kkonganti@17 27 &nbsp;
kkonganti@17 28
kkonganti@17 29 ## Minimum Requirements
kkonganti@17 30
kkonganti@17 31 1. [Nextflow version 23.04.3](https://github.com/nextflow-io/nextflow/releases/download/v23.04.3/nextflow).
kkonganti@17 32 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
kkonganti@17 33 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
kkonganti@17 34 2. Either of `micromamba` (version `1.0.0`) or `docker` or `singularity` installed and made available in your `$PATH`.
kkonganti@17 35 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
kkonganti@17 36 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/micromamba-installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`.
kkonganti@17 37 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
kkonganti@17 38 - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.0.0`**.
kkonganti@17 39
kkonganti@17 40 ```bash
kkonganti@17 41 micromamba self-update --version 1.0.0
kkonganti@17 42 ```
kkonganti@17 43
kkonganti@17 44 3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
kkonganti@17 45
kkonganti@17 46 \
kkonganti@17 47 &nbsp;
kkonganti@17 48
kkonganti@17 49 ## CFSAN GalaxyTrakr
kkonganti@17 50
kkonganti@17 51 The `bettercallsal` pipeline is also available for use on the [Galaxy instance supported by CFSAN, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which you can run the workflow using some test data by following the instructions
kkonganti@17 52 [from this PDF](https://research.foodsafetyrisk.org/bettercallsal/galaxytrakr/bettercallsal_on_cfsan_galaxytrakr.pdf).
kkonganti@17 53
kkonganti@17 54 Please note that the pipeline on [CFSAN GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization.
kkonganti@17 55
kkonganti@17 56 \
kkonganti@17 57 &nbsp;
kkonganti@17 58
kkonganti@17 59 ## Usage and Examples
kkonganti@17 60
kkonganti@17 61 Clone or download this repository and then call `cpipes`.
kkonganti@17 62
kkonganti@17 63 ```bash
kkonganti@17 64 cpipes --pipeline bettercallsal [options]
kkonganti@17 65 ```
kkonganti@17 66
kkonganti@17 67 Alternatively, you can use `nextflow` to directly pull and run the pipeline.
kkonganti@17 68
kkonganti@17 69 ```bash
kkonganti@17 70 nextflow pull CFSAN-Biostatistics/bettercallsal
kkonganti@17 71 nextflow list
kkonganti@17 72 nextflow info CFSAN-Biostatistics/bettercallsal
kkonganti@17 73 nextflow run CFSAN-Biostatistics/bettercallsal --pipeline bettercallsal_db --help
kkonganti@17 74 nextflow run CFSAN-Biostatistics/bettercallsal --pipeline bettercallsal --help
kkonganti@17 75 ```
kkonganti@17 76
kkonganti@17 77 \
kkonganti@17 78 &nbsp;
kkonganti@17 79
kkonganti@17 80 **Example**: Run the default `bettercallsal` pipeline in single-end mode.
kkonganti@17 81
kkonganti@17 82 ```bash
kkonganti@17 83 cd /data/scratch/$USER
kkonganti@17 84 mkdir nf-cpipes
kkonganti@17 85 cd nf-cpipes
kkonganti@17 86 cpipes
kkonganti@17 87 --pipeline bettercallsal \
kkonganti@17 88 --input /path/to/illumina/fastq/dir \
kkonganti@17 89 --output /path/to/output \
kkonganti@17 90 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
kkonganti@17 91 ```
kkonganti@17 92
kkonganti@17 93 \
kkonganti@17 94 &nbsp;
kkonganti@17 95
kkonganti@17 96 **Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2023.1200983/full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`.
kkonganti@17 97
kkonganti@17 98 ```bash
kkonganti@17 99 cd /data/scratch/$USER
kkonganti@17 100 mkdir nf-cpipes
kkonganti@17 101 cd nf-cpipes
kkonganti@17 102 cpipes \
kkonganti@17 103 --pipeline bettercallsal \
kkonganti@17 104 --input /path/to/illumina/fastq/dir \
kkonganti@17 105 --output /path/to/output \
kkonganti@17 106 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876 \
kkonganti@17 107 --fq_single_end false \
kkonganti@17 108 --fq_suffix '_R1_001.fastq.gz'
kkonganti@17 109 ```
kkonganti@17 110
kkonganti@17 111 \
kkonganti@17 112 &nbsp;
kkonganti@17 113
kkonganti@17 114 ### Database
kkonganti@17 115
kkonganti@17 116 ---
kkonganti@17 117
kkonganti@17 118 The successful run of the workflow requires certain database flat files specific for the workflow.
kkonganti@17 119
kkonganti@17 120 Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release.
kkonganti@17 121
kkonganti@17 122 &nbsp;
kkonganti@17 123
kkonganti@17 124 ### Input
kkonganti@17 125
kkonganti@17 126 ---
kkonganti@17 127
kkonganti@17 128 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
kkonganti@17 129
kkonganti@17 130 For example, if the directory contains FASTQ files as shown below:
kkonganti@17 131
kkonganti@17 132 - KB-01_apple_L001_R1.fastq.gz
kkonganti@17 133 - KB-01_apple_L001_R2.fastq.gz
kkonganti@17 134 - KB-01_apple_L002_R1.fastq.gz
kkonganti@17 135 - KB-01_apple_L002_R2.fastq.gz
kkonganti@17 136 - KB-02_mango_L001_R1.fastq.gz
kkonganti@17 137 - KB-02_mango_L001_R2.fastq.gz
kkonganti@17 138 - KB-02_mango_L002_R1.fastq.gz
kkonganti@17 139 - KB-02_mango_L002_R2.fastq.gz
kkonganti@17 140
kkonganti@17 141 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
kkonganti@17 142
kkonganti@17 143 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
kkonganti@17 144
kkonganti@17 145 \
kkonganti@17 146 &nbsp;
kkonganti@17 147
kkonganti@17 148 ### Output
kkonganti@17 149
kkonganti@17 150 ---
kkonganti@17 151
kkonganti@17 152 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
kkonganti@17 153
kkonganti@17 154 \
kkonganti@17 155 &nbsp;
kkonganti@17 156
kkonganti@17 157 ### Computational resources
kkonganti@17 158
kkonganti@17 159 ---
kkonganti@17 160
kkonganti@17 161 The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.
kkonganti@17 162
kkonganti@17 163 \
kkonganti@17 164 &nbsp;
kkonganti@17 165
kkonganti@17 166 Example:
kkonganti@17 167
kkonganti@17 168 ```bash
kkonganti@17 169 cpipes \
kkonganti@17 170 --pipeline bettercallsal \
kkonganti@17 171 --input /path/to/bettercallsal_sim_reads \
kkonganti@17 172 --output /path/to/bettercallsal_sim_reads_output \
kkonganti@17 173 --bcs_root_dbdir /path/to/PDG000000002.2876
kkonganti@17 174 --kmaalign_ignorequals \
kkonganti@17 175 --max_cpus 5 \
kkonganti@17 176 -profile stdkondagac \
kkonganti@17 177 -resume
kkonganti@17 178 ```
kkonganti@17 179
kkonganti@17 180 \
kkonganti@17 181 &nbsp;
kkonganti@17 182
kkonganti@17 183 ### Runtime profiles
kkonganti@17 184
kkonganti@17 185 ---
kkonganti@17 186
kkonganti@17 187 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
kkonganti@17 188
kkonganti@17 189 \
kkonganti@17 190 &nbsp;
kkonganti@17 191
kkonganti@17 192 Example:
kkonganti@17 193
kkonganti@17 194 ```bash
kkonganti@17 195 cd /data/scratch/$USER
kkonganti@17 196 mkdir nf-cpipes
kkonganti@17 197 cd nf-cpipes
kkonganti@17 198 cpipes \
kkonganti@17 199 --pipeline bettercallsal \
kkonganti@17 200 --input /path/to/fastq_pass_dir \
kkonganti@17 201 --output /path/to/where/output/should/go \
kkonganti@17 202 -profile your_institution
kkonganti@17 203 ```
kkonganti@17 204
kkonganti@17 205 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.
kkonganti@17 206
kkonganti@17 207 \
kkonganti@17 208 &nbsp;
kkonganti@17 209
kkonganti@17 210 ### `your_institution.config`
kkonganti@17 211
kkonganti@17 212 ---
kkonganti@17 213
kkonganti@17 214 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
kkonganti@17 215
kkonganti@17 216 \
kkonganti@17 217 &nbsp;
kkonganti@17 218
kkonganti@17 219 ```groovy
kkonganti@17 220 your_institution {
kkonganti@17 221 process.executor = 'sge'
kkonganti@17 222 process.queue = 'normal.q'
kkonganti@17 223 singularity.enabled = false
kkonganti@17 224 singularity.autoMounts = true
kkonganti@17 225 docker.enabled = false
kkonganti@17 226 params.enable_conda = true
kkonganti@17 227 conda.enabled = true
kkonganti@17 228 conda.useMicromamba = true
kkonganti@17 229 params.enable_module = false
kkonganti@17 230 }
kkonganti@17 231 ```
kkonganti@17 232
kkonganti@17 233 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
kkonganti@17 234
kkonganti@17 235 \
kkonganti@17 236 &nbsp;
kkonganti@17 237
kkonganti@17 238 ### Cloud computing
kkonganti@17 239
kkonganti@17 240 ---
kkonganti@17 241
kkonganti@17 242 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
kkonganti@17 243
kkonganti@17 244 \
kkonganti@17 245 &nbsp;
kkonganti@17 246
kkonganti@17 247 Example:
kkonganti@17 248
kkonganti@17 249 ```groovy
kkonganti@17 250 my_aws_batch {
kkonganti@17 251 executor = 'awsbatch'
kkonganti@17 252 queue = 'my-batch-queue'
kkonganti@17 253 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
kkonganti@17 254 aws.batch.region = 'us-east-1'
kkonganti@17 255 singularity.enabled = false
kkonganti@17 256 singularity.autoMounts = true
kkonganti@17 257 docker.enabled = true
kkonganti@17 258 params.conda_enabled = false
kkonganti@17 259 params.enable_module = false
kkonganti@17 260 }
kkonganti@17 261 ```
kkonganti@17 262
kkonganti@17 263 \
kkonganti@17 264 &nbsp;
kkonganti@17 265
kkonganti@17 266 ### Example data
kkonganti@17 267
kkonganti@17 268 ---
kkonganti@17 269
kkonganti@17 270 After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions.
kkonganti@17 271
kkonganti@17 272 - Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB).
kkonganti@17 273 - Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads.
kkonganti@17 274 - Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.2727](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2727.tar.bz2) version of the database (~ 42 GB).
kkonganti@17 275 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html).
kkonganti@17 276 - It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline.
kkonganti@17 277
kkonganti@17 278 ```bash
kkonganti@17 279 realpath /hpc/scratch/user/input
kkonganti@17 280 ```
kkonganti@17 281
kkonganti@17 282 Now run the workflow by ignoring quality values since these are simulated base qualities:
kkonganti@17 283
kkonganti@17 284 \
kkonganti@17 285 &nbsp;
kkonganti@17 286
kkonganti@17 287 ```bash
kkonganti@17 288 cpipes \
kkonganti@17 289 --pipeline bettercallsal \
kkonganti@17 290 --input /path/to/bettercallsal_sim_reads \
kkonganti@17 291 --output /path/to/bettercallsal_sim_reads_output \
kkonganti@17 292 --bcs_root_dbdir /path/to/PDG000000002.2876
kkonganti@17 293 --kmaalign_ignorequals \
kkonganti@17 294 -profile stdkondagac \
kkonganti@17 295 -resume
kkonganti@17 296 ```
kkonganti@17 297
kkonganti@17 298 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
kkonganti@17 299
kkonganti@17 300 \
kkonganti@17 301 &nbsp;
kkonganti@17 302
kkonganti@17 303 ## Using `sourmash`
kkonganti@17 304
kkonganti@17 305 Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype.
kkonganti@17 306
kkonganti@17 307 You can turn **OFF** this feature with `--sourmashsketch_run false` option.
kkonganti@17 308
kkonganti@17 309 \
kkonganti@17 310 &nbsp;
kkonganti@17 311
kkonganti@17 312 ## `bettercallsal` CLI Help
kkonganti@17 313
kkonganti@17 314 ```text
kkonganti@17 315 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help
kkonganti@17 316 N E X T F L O W ~ version 23.04.3
kkonganti@17 317 Launching `./bettercallsal/cpipes` [awesome_chandrasekhar] DSL2 - revision: 8da4e11078
kkonganti@17 318 ================================================================================
kkonganti@17 319 (o)
kkonganti@17 320 ___ _ __ _ _ __ ___ ___
kkonganti@17 321 / __|| '_ \ | || '_ \ / _ \/ __|
kkonganti@17 322 | (__ | |_) || || |_) || __/\__ \
kkonganti@17 323 \___|| .__/ |_|| .__/ \___||___/
kkonganti@17 324 | | | |
kkonganti@17 325 |_| |_|
kkonganti@17 326 --------------------------------------------------------------------------------
kkonganti@17 327 A collection of modular pipelines at CFSAN, FDA.
kkonganti@17 328 --------------------------------------------------------------------------------
kkonganti@17 329 Name : bettercallsal
kkonganti@17 330 Author : Kranti Konganti
kkonganti@17 331 Version : 0.7.0
kkonganti@17 332 Center : CFSAN, FDA.
kkonganti@17 333 ================================================================================
kkonganti@17 334
kkonganti@17 335
kkonganti@17 336 --------------------------------------------------------------------------------
kkonganti@17 337 Show configurable CLI options for each tool within bettercallsal
kkonganti@17 338 --------------------------------------------------------------------------------
kkonganti@17 339 Ex: cpipes --pipeline bettercallsal --help
kkonganti@17 340 Ex: cpipes --pipeline bettercallsal --help fastp
kkonganti@17 341 Ex: cpipes --pipeline bettercallsal --help fastp,mash
kkonganti@17 342 --------------------------------------------------------------------------------
kkonganti@17 343 --help bbmerge : Show bbmerge.sh CLI options
kkonganti@17 344 --help fastp : Show fastp CLI options
kkonganti@17 345 --help mash : Show mash `screen` CLI options
kkonganti@17 346 --help tuspy : Show get_top_unique_mash_hit_genomes.py CLI
kkonganti@17 347 options
kkonganti@17 348 --help sourmashsketch : Show sourmash `sketch` CLI options
kkonganti@17 349 --help sourmashgather : Show sourmash `gather` CLI options
kkonganti@17 350 --help sourmashsearch : Show sourmash `search` CLI options
kkonganti@17 351 --help sfhpy : Show sourmash_filter_hits.py CLI options
kkonganti@17 352 --help kmaindex : Show kma `index` CLI options
kkonganti@17 353 --help kmaalign : Show kma CLI options
kkonganti@17 354 --help megahit : Show megahit CLI options
kkonganti@17 355 --help mlst : Show mlst CLI options
kkonganti@17 356 --help abricate : Show abricate CLI options
kkonganti@17 357 --help salmon : Show salmon `index` CLI options
kkonganti@17 358 --help gsrpy : Show gen_salmon_res_table.py CLI options
kkonganti@17 359
kkonganti@17 360 ```