comparison 1.0.0/readme/bettercallsal.md @ 0:801b85b03a17 draft default tip

planemo upload
author galaxytrakr
date Thu, 28 May 2026 20:31:42 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:801b85b03a17
1 # bettercallsal
2
3 `bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture.
4
5 \
6  
7
8 <!-- TOC -->
9
10 - [Minimum Requirements](#minimum-requirements)
11 - [CFSAN GalaxyTrakr](#cfsan-galaxytrakr)
12 - [Usage and Examples](#usage-and-examples)
13 - [Database](#database)
14 - [Input](#input)
15 - [Output](#output)
16 - [Computational resources](#computational-resources)
17 - [Runtime profiles](#runtime-profiles)
18 - [your_institution.config](#your_institutionconfig)
19 - [Cloud computing](#cloud-computing)
20 - [Example data](#example-data)
21 - [ONT long reads](#ont-long-reads)
22 - [Using sourmash](#using-sourmash)
23 - [bettercallsal CLI Help](#bettercallsal-cli-help)
24 - [bettercallsal_lr CLI Help](#bettercallsal_lr-cli-help)
25
26 <!-- /TOC -->
27
28 \
29 &nbsp;
30
31 ## Minimum Requirements
32
33 1. [Nextflow version 24.04.3](https://github.com/nextflow-io/nextflow/releases/download/v24.04.3/nextflow).
34 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
35 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-21-ug/downloads-list.html).
36 2. Either of `micromamba` (version `1.5.9`) or `docker` or `singularity` installed and made available in your `$PATH`.
37 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
38 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#linux-and-macos) and make sure that the `micromamba` binary is made available in your `$PATH`.
39 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
40 - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.5.9`**.
41 - First check, if your version is other than `1.5.9` and if not, do the downgrade.
42
43 ```bash
44 micromamba --version
45 micromamba self-update --version 1.5.9 -c conda-forge
46 ```
47
48 3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
49
50 \
51 &nbsp;
52
53 ## CFSAN GalaxyTrakr
54
55 The `bettercallsal` pipeline is also available for use on the [Galaxy instance supported by CFSAN, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which you can run the workflow using some test data by following the instructions
56 [from this PDF](https://research.foodsafetyrisk.org/bettercallsal/galaxytrakr/bettercallsal_on_cfsan_galaxytrakr.pdf).
57
58 Please note that the pipeline on [CFSAN GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization.
59
60 \
61 &nbsp;
62
63 ## Usage and Examples
64
65 Clone or download this repository and then call `cpipes`.
66
67 ```bash
68 cpipes --pipeline bettercallsal [options]
69 ```
70
71 Alternatively, you can use `nextflow` to directly pull and run the pipeline.
72
73 ```bash
74 nextflow pull CFSAN-Biostatistics/bettercallsal
75 nextflow list
76 nextflow info CFSAN-Biostatistics/bettercallsal
77 nextflow run CFSAN-Biostatistics/bettercallsal --pipeline bettercallsal_db --help
78 nextflow run CFSAN-Biostatistics/bettercallsal --pipeline bettercallsal --help
79 ```
80
81 \
82 &nbsp;
83
84 **Example**: Run the default `bettercallsal` pipeline in single-end mode.
85
86 ```bash
87 cd /data/scratch/$USER
88 mkdir nf-cpipes
89 cd nf-cpipes
90 cpipes
91 --pipeline bettercallsal \
92 --input /path/to/illumina/fastq/dir \
93 --output /path/to/output \
94 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.3082
95 ```
96
97 \
98 &nbsp;
99
100 **Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2023.1200983/full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`.
101
102 ```bash
103 cd /data/scratch/$USER
104 mkdir nf-cpipes
105 cd nf-cpipes
106 cpipes \
107 --pipeline bettercallsal \
108 --input /path/to/illumina/fastq/dir \
109 --output /path/to/output \
110 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.3082 \
111 --fq_single_end false \
112 --fq_suffix '_R1_001.fastq.gz'
113 ```
114
115 \
116 &nbsp;
117
118 ### Database
119
120 ---
121
122 The successful run of the workflow requires certain database flat files specific for the workflow.
123
124 Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release.
125
126 &nbsp;
127
128 ### Input
129
130 ---
131
132 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
133
134 For example, if the directory contains FASTQ files as shown below:
135
136 - KB-01_apple_L001_R1.fastq.gz
137 - KB-01_apple_L001_R2.fastq.gz
138 - KB-01_apple_L002_R1.fastq.gz
139 - KB-01_apple_L002_R2.fastq.gz
140 - KB-02_mango_L001_R1.fastq.gz
141 - KB-02_mango_L001_R2.fastq.gz
142 - KB-02_mango_L002_R1.fastq.gz
143 - KB-02_mango_L002_R2.fastq.gz
144
145 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
146
147 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
148
149 \
150 &nbsp;
151
152 ### ONT long reads
153
154 ---
155
156 Beginning with `v1.0.0`, `bettercallsal` supports **ONT** long reads. Use the `--pipeline bettercallsal_lr` to activate this feature. The `bettercallsal_lr` variant of the pipeline uses `filtlong` to perform quality filtering of **ONT** long reads and `flye` to perform long read assembly. **FastQC** is run before and after quality filtering for read quality inspection via **MultiQC** report.
157
158 \
159 &nbsp;
160
161 ### Output
162
163 ---
164
165 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
166
167 \
168 &nbsp;
169
170 ### Computational resources
171
172 ---
173
174 The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.
175
176 \
177 &nbsp;
178
179 Example:
180
181 ```bash
182 cpipes \
183 --pipeline bettercallsal \
184 --input /path/to/bettercallsal_sim_reads \
185 --output /path/to/bettercallsal_sim_reads_output \
186 --bcs_root_dbdir /path/to/PDG000000002.3082
187 --kmaalign_ignorequals \
188 --max_cpus 5 \
189 -profile stdkondagac \
190 -resume
191 ```
192
193 \
194 &nbsp;
195
196 ### Runtime profiles
197
198 ---
199
200 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
201
202 \
203 &nbsp;
204
205 Example:
206
207 ```bash
208 cd /data/scratch/$USER
209 mkdir nf-cpipes
210 cd nf-cpipes
211 cpipes \
212 --pipeline bettercallsal \
213 --input /path/to/fastq_pass_dir \
214 --output /path/to/where/output/should/go \
215 -profile your_institution
216 ```
217
218 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.
219
220 \
221 &nbsp;
222
223 ### `your_institution.config`
224
225 ---
226
227 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
228
229 \
230 &nbsp;
231
232 ```groovy
233 your_institution {
234 process.executor = 'sge'
235 process.queue = 'normal.q'
236 singularity.enabled = false
237 singularity.autoMounts = true
238 docker.enabled = false
239 params.enable_conda = true
240 conda.enabled = true
241 conda.useMicromamba = true
242 params.enable_module = false
243 }
244 ```
245
246 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
247
248 \
249 &nbsp;
250
251 ### Cloud computing
252
253 ---
254
255 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
256
257 \
258 &nbsp;
259
260 Example:
261
262 ```groovy
263 my_aws_batch {
264 executor = 'awsbatch'
265 queue = 'my-batch-queue'
266 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
267 aws.batch.region = 'us-east-1'
268 singularity.enabled = false
269 singularity.autoMounts = true
270 docker.enabled = true
271 params.conda_enabled = false
272 params.enable_module = false
273 }
274 ```
275
276 \
277 &nbsp;
278
279 ### Example data
280
281 ---
282
283 After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions.
284
285 - Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB).
286 - Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads.
287 - Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.3082](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.3082.tar.gz) version of the database (~ 48 GB).
288 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html).
289 - It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline.
290
291 ```bash
292 realpath /hpc/scratch/user/input
293 ```
294
295 Now run the workflow by ignoring quality values since these are simulated base qualities:
296
297 \
298 &nbsp;
299
300 ```bash
301 cpipes \
302 --pipeline bettercallsal \
303 --input /path/to/bettercallsal_sim_reads \
304 --output /path/to/bettercallsal_sim_reads_output \
305 --bcs_root_dbdir /path/to/PDG000000002.3082
306 --kmaalign_ignorequals \
307 -profile stdkondagac \
308 -resume
309 ```
310
311 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
312
313 \
314 &nbsp;
315
316 ## Using `sourmash`
317
318 Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype.
319
320 You can turn **OFF** this feature with `--sourmashsketch_run false` option.
321
322 \
323 &nbsp;
324
325 ## `bettercallsal` CLI Help
326
327 ```text
328 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help
329
330 N E X T F L O W ~ version 24.04.3
331
332 Launching `~/apps/bettercallsal/1.0.0/cpipes` [loving_curry] DSL2 - revision: d9b4be42be
333
334 ================================================================================
335 (o)
336 ___ _ __ _ _ __ ___ ___
337 / __|| '_ \ | || '_ \ / _ \/ __|
338 | (__ | |_) || || |_) || __/\__ \
339 \___|| .__/ |_|| .__/ \___||___/
340 | | | |
341 |_| |_|
342 --------------------------------------------------------------------------------
343 A collection of modular pipelines at CFSAN, FDA.
344 --------------------------------------------------------------------------------
345 Name : bettercallsal
346 Author : Kranti Konganti
347 Version : 0.9.0
348 Center : CFSAN, FDA.
349 ================================================================================
350
351
352 --------------------------------------------------------------------------------
353 Show configurable CLI options for each tool within bettercallsal
354 --------------------------------------------------------------------------------
355 Ex: cpipes --pipeline bettercallsal --help
356 Ex: cpipes --pipeline bettercallsal --help fastp
357 Ex: cpipes --pipeline bettercallsal --help fastp,mash
358 --------------------------------------------------------------------------------
359 --help bbmerge : Show bbmerge.sh CLI options
360 --help fastp : Show fastp CLI options
361 --help mash : Show mash `screen` CLI options
362 --help tuspy : Show get_top_unique_mash_hit_genomes.py CLI
363 options
364 --help sourmashsketch : Show sourmash `sketch` CLI options
365 --help sourmashgather : Show sourmash `gather` CLI options
366 --help sourmashsearch : Show sourmash `search` CLI options
367 --help sfhpy : Show sourmash_filter_hits.py CLI options
368 --help kmaindex : Show kma `index` CLI options
369 --help kmaalign : Show kma CLI options
370 --help megahit : Show megahit CLI options
371 --help mlst : Show mlst CLI options
372 --help abricate : Show abricate CLI options
373 --help salmon : Show salmon `index` CLI options
374 --help gsrpy : Show gen_salmon_res_table.py CLI options
375
376 ```
377
378 \
379 &nbsp;
380
381 ## `bettercallsal_lr` CLI Help
382
383 ```text
384 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_lr --help
385
386 N E X T F L O W ~ version 24.04.3
387
388 Launching `~/apps/bettercallsal/1.0.0/cpipes` [friendly_sax] DSL2 - revision: d9b4be42be
389
390 ================================================================================
391 (o)
392 ___ _ __ _ _ __ ___ ___
393 / __|| '_ \ | || '_ \ / _ \/ __|
394 | (__ | |_) || || |_) || __/\__ \
395 \___|| .__/ |_|| .__/ \___||___/
396 | | | |
397 |_| |_|
398 --------------------------------------------------------------------------------
399 A collection of modular pipelines at CFSAN, FDA.
400 --------------------------------------------------------------------------------
401 Name : bettercallsal
402 Author : Kranti Konganti
403 Version : 0.9.0
404 Center : CFSAN, FDA.
405 ================================================================================
406
407
408 --------------------------------------------------------------------------------
409 Show configurable CLI options for each tool within bettercallsal_lr
410 --------------------------------------------------------------------------------
411 Ex: cpipes --pipeline bettercallsal_lr --help
412 Ex: cpipes --pipeline bettercallsal_lr --help fastp
413 Ex: cpipes --pipeline bettercallsal_lr --help fastp,mash
414 --------------------------------------------------------------------------------
415 --help filtlong : Show filtlong CLI options
416 --help mash : Show mash `screen` CLI options
417 --help tuspy : Show get_top_unique_mash_hit_genomes.py CLI
418 options
419 --help sourmashsketch : Show sourmash `sketch` CLI options
420 --help sourmashgather : Show sourmash `gather` CLI options
421 --help sourmashsearch : Show sourmash `search` CLI options
422 --help sfhpy : Show sourmash_filter_hits.py CLI options
423 --help flye : Show flye CLI options
424 --help mlst : Show mlst CLI options
425 --help abricate : Show abricate CLI options
426 --help gsrpy : Show gen_salmon_res_table.py CLI options
427 ```