comparison 0.4.2/readme/centriflaken.md @ 0:082e0091e813 draft default tip

planemo upload
author galaxytrakr
date Fri, 29 May 2026 13:27:47 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:082e0091e813
1 # centriflaken
2
3 `centriflaken` is an automated precision metagenomics workflow for assembly and _in silico_ analyses of food-borne pathogens. `centriflaken` primarily fine-tuned for detecting and classifying Shiga toxin-producing **_Escherichia coli_** (**STEC**), can also be used for performing analyses on other food-borne pathogens such as **_Salmonella enterica_**. `centriflaken` takes as input a UNIX path to FASTQ, generates MAGs, and performs in silico-based analysis for STECs as described in [Maguire et al. 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0245172).
4
5 `centriflaken` works on both **Illumina** short reads and **Oxford Nanopore** long reads.
6
7 It is written in **Nextflow** and is part of the modular data analysis pipelines at **HFP**.
8
9 \
10  
11
12 <!-- TOC -->
13
14 - [Minimum Requirements](#minimum-requirements)
15 - [HFP GalaxyTrakr](#hfp-galaxytrakr)
16 - [Usage and Examples](#usage-and-examples)
17 - [Databases](#databases)
18 - [Input](#input)
19 - [Illumina short reads](#illumina-short-reads)
20 - [Output](#output)
21 - [Computational resources](#computational-resources)
22 - [Runtime profiles](#runtime-profiles)
23 - [your_institution.config](#your_institutionconfig)
24 - [Test run](#test-run)
25 - [centriflaken CLI Help](#centriflaken-cli-help)
26 - [centriflaken_hy CLI Help](#centriflaken_hy-cli-help)
27
28 <!-- /TOC -->
29
30 \
31 &nbsp;
32
33 ## Minimum Requirements
34
35 1. [Nextflow version 24.10.4](https://github.com/nextflow-io/nextflow/releases/download/v24.10.4/nextflow).
36 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
37 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-21-ug/downloads-list.html).
38 2. Either of `micromamba` (version `1.5.9`) or `docker` or `singularity` installed and made available in your `$PATH`.
39 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
40 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#linux-and-macos) and make sure that the `micromamba` binary is made available in your `$PATH`.
41 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
42 - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.5.9`**.
43 - First check, if your version is other than `1.5.9` and if not, do the downgrade.
44
45 ```bash
46 micromamba --version
47 micromamba self-update --version 1.5.9 -c conda-forge
48 ```
49
50 3. Minimum of 10 CPU cores and about 60 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
51
52 \
53 &nbsp;
54
55 ## HFP GalaxyTrakr
56
57 The `centriflaken` pipeline is also available for use on the [Galaxy instance supported by HFP, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which [you can run the workflow using this protocol](https://www.protocols.io/view/centriflaken-an-automated-data-analysis-pipeline-f-kxygxzdbwv8j/v5).
58
59 Please note that the pipeline on [HFP GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization.
60
61 \
62 &nbsp;
63
64 ## Usage and Examples
65
66 Clone or download this repository and then call `cpipes`.
67
68 ```bash
69 cpipes --pipeline centriflaken [options]
70 ```
71
72 Alternatively, you can use `nextflow` to directly pull and run the pipeline.
73
74 ```bash
75 nextflow pull CFSAN-Biostatistics/centriflaken
76 nextflow list
77 nextflow info CFSAN-Biostatistics/centriflaken
78 nextflow run CFSAN-Biostatistics/centriflaken --pipeline centriflaken --help
79 nextflow run CFSAN-Biostatistics/centriflaken --pipeline centriflaken_hy --help
80 ```
81
82 \
83 &nbsp;
84
85 ### Databases
86
87 ---
88
89 The successful run of the workflow requires all of the following databases:
90
91 - `kraken2`, `centrifuge`, `serotypefinder` and `abricate`: [Download](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/centriflaken_dbs.tar.bz2).
92
93 Once you have downloaded the databases, uncompress and set the **UNIX** path's in the configuration files as follows:
94
95 - [Line no. 4](../workflows/conf/centriflaken.config#L4): `centrifuge_x = /path/to/centriflaken_dbs/centrifuge/ab`. The `ab` prefix is necessary.
96 - [Line no. 11](../workflows/conf/centriflaken_hy.config#L11): `centrifuge_x = /path/to/centriflaken_dbs/centrifuge/ab`. The `ab` prefix is necessary.
97 - [Line no. 10](../workflows/conf/centriflaken.config#L10): `kraken2_db = /path/to/centriflaken_dbs/kraken2`.
98 - [Line no. 17](../workflows/conf/centriflaken_hy.config#L17): `kraken2_db = /path/to/centriflaken_dbs/kraken2`.
99 - [Line no. 36](../workflows/conf/centriflaken.config#L36): `serotypefinder_db = /path/to/centriflaken_dbs/serotypefinder`.
100 - [Line no. 64](../workflows/conf/centriflaken_hy.config#L64): `serotypefinder_db = /path/to/centriflaken_dbs/serotypefinder`.
101 - [Line no. 53](../workflows/conf/centriflaken.config#L53): `abricate_datadir = /path/to/centriflaken_dbs/abricate`.
102 - [Line no. 81](../workflows/conf/centriflaken_hy.config#L81): `abricate_datadir = /path/to/centriflaken_dbs/abricate`.
103
104 \
105 &nbsp;
106
107 ### Input
108
109 ---
110
111 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files of long reads or short reads. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
112
113 For example, if the directory contains FASTQ files as shown below:
114
115 - KB-01_apple_L001_R1.fastq.gz
116 - KB-01_apple_L001_R2.fastq.gz
117 - KB-01_apple_L002_R1.fastq.gz
118 - KB-01_apple_L002_R2.fastq.gz
119 - KB-02_mango_L001_R1.fastq.gz
120 - KB-02_mango_L001_R2.fastq.gz
121 - KB-02_mango_L002_R1.fastq.gz
122 - KB-02_mango_L002_R2.fastq.gz
123
124 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
125
126 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
127
128 \
129 &nbsp;
130
131 ### Illumina short reads
132
133 ---
134
135 `centriflaken` was primarily developed for **ONT** long reads but also supports **Illumina** short reads. Use the `--pipeline centriflaken_hy` instead of `--pipeline centriflaken` to activate this feature. The `centriflaken_hy` variant of the pipeline uses `megahit` instead of `flye` to perform short read assembly. There is no other change needed from the user other than using the `--pipeline centriflaken_hy` parameter for Illumina short reads.
136
137 \
138 &nbsp;
139
140 ### Output
141
142 ---
143
144 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `centriflaken-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
145
146 \
147 &nbsp;
148
149 ### Computational resources
150
151 ---
152
153 The workflows `centriflaken` and `centriflaken_hy` require at least a minimum of 60 GBs of memory to successfully finish the workflow.
154
155 \
156 &nbsp;
157
158 ### Runtime profiles
159
160 ---
161
162 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
163
164 \
165 &nbsp;
166
167 Example:
168
169 ```bash
170 cd /data/scratch/$USER
171 mkdir nf-cpipes
172 cd nf-cpipes
173 cpipes \
174 --pipeline centriflaken \
175 --input /path/to/fastq_pass_dir \
176 --output /path/to/where/output/should/go \
177 -profile your_institution
178 ```
179
180 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-centriflaken` would hold all the **NEXTFLOW** related logs, reports and trace files.
181
182 \
183 &nbsp;
184
185 ### `your_institution.config`
186
187 ---
188
189 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
190
191 \
192 &nbsp;
193
194 ```groovy
195 your_institution {
196 process.executor = 'sge'
197 process.queue = 'normal.q'
198 singularity.enabled = false
199 singularity.autoMounts = true
200 docker.enabled = false
201 params.enable_conda = true
202 conda.enabled = true
203 conda.useMicromamba = true
204 params.enable_module = false
205 }
206 ```
207
208 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `centriflaken` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
209
210 \
211 &nbsp;
212
213 ### Cloud computing
214
215 ---
216
217 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
218
219 \
220 &nbsp;
221
222 Example:
223
224 ```groovy
225 my_aws_batch {
226 executor = 'awsbatch'
227 queue = 'my-batch-queue'
228 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
229 aws.batch.region = 'us-east-1'
230 singularity.enabled = false
231 singularity.autoMounts = true
232 docker.enabled = true
233 params.conda_enabled = false
234 params.enable_module = false
235 }
236 ```
237
238 \
239 &nbsp;
240
241 ### Test run
242
243 ---
244
245 After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `centriflaken` pipeline on some subsampled reads belonging to the NCBI BioProject `PRJNA639799` as discussed in [Maguire _et al_](https://pmc.ncbi.nlm.nih.gov/articles/PMC10500926/).
246
247 - Please note that the input reads are subsampled to validate the software install.
248 - Download them [from S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/macguire_et_al_subsampled_reads.tar.bz2) (~ 20 GB).
249
250 | Samples | Biosample | SRA accession | Flowcell |
251 |:---------------------------------------------------------------|:-------------|:--------------|:---------|
252 | FAL00958 | SAMN46790801 | SRR32346290 | FAL00958 |
253 | FAL01198 | SAMN46793213 | SRR32346289 | FAL01198 |
254 | FAL01556 | SAMN46793220 | SRR32346278 | FAL01556 |
255 | ZymoBIOMICS Microbial Community DNA Standard R1 | SAMN46793392 | SRR32381322 | FAL11413 |
256 | ZymoBIOMICS Microbial Community DNA Standard R2 | SAMN46793393 | SRR32381321 | FAL01565 |
257 | ZymoBIOMICS Microbial Community Standard II - log distribution | SAMN46793397 | SRR32381320 | FAL01514 |
258
259 - Download pre-formatted databases (**MANDATORY**) [from S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/centriflaken_dbs.tar.bz2) (~ 47 GB).
260 - One of the assembly jobs should fail to assemble the reads and the pipeline will ignore the failed assembly and finish to completion.
261 - After successful download, untar and change the paths to the databases in **BOTH** the [long reads conf file](../workflows/conf/centriflaken.config) and [short reads conf file](../workflows/conf/centriflaken_hy.config) as described in the [Databases](#databases) section.
262 - The following values should point to the UNIX paths of the downloaded databases.
263
264 ```bash
265 centrifuge_x = '/path/to/centrifuge/ab' # /ab suffix SHOULD NOT change. Only the /path/to/centrifuge changes to your specific UNIX path.
266 kraken2_db = '/path/to/kraken2'
267 serotypefinder_db = '/path/to/serotypefinder'
268 abricate_datadir = '/path/to/abricate'
269 amrfinderplus_db = '/hpc/db/amrfinderplus/3.10.24/latest' # IGNORE THIS PATH SINCE AMRFINDERPLUS SHOULD NOT BE RUN.
270 ```
271
272 - It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline.
273
274 ```bash
275 realpath /hpc/scratch/user/input/srr
276 ```
277
278 - Now run the workflow by ignoring quality values since these are simulated base qualities:
279
280 ```bash
281 cpipes \
282 --pipeline centriflaken \
283 --input /path/to/macguire_et_al_subsampled_reads \
284 --output /path/to/centriflaken_test_output \
285 -profile stdkondagac \
286 -resume
287 ```
288
289 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.us-east-1.amazonaws.com/Kranti.Konganti/centriflaken/macquire_et_al_test_report.html).
290
291 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
292
293 \
294 &nbsp;
295
296 ## `centriflaken` CLI Help
297
298 ```text
299 cpipes --pipeline centriflaken --help
300
301 N E X T F L O W ~ version 24.10.4
302
303 Launching `/home/user/centriflaken/cpipes` [sleepy_pauling] DSL2 - revision: 55d6f63710
304
305 ================================================================================
306 (o)
307 ___ _ __ _ _ __ ___ ___
308 / __|| '_ \ | || '_ \ / _ \/ __|
309 | (__ | |_) || || |_) || __/\__ \
310 \___|| .__/ |_|| .__/ \___||___/
311 | | | |
312 |_| |_|
313 --------------------------------------------------------------------------------
314 A collection of modular pipelines at CFSAN, FDA.
315 --------------------------------------------------------------------------------
316 Name : CPIPES
317 Author : Kranti.Konganti@fda.hhs.gov
318 Version : 0.4.1
319 Center : CFSAN, FDA.
320 ================================================================================
321
322 Workflow : centriflaken
323
324 Author : Kranti.Konganti@fda.hhs.gov
325
326 Version : 0.4.2
327
328
329 Usage : cpipes --pipeline centriflaken [options]
330
331
332 Required :
333
334 --input : Absolute path to directory containing FASTQ
335 files. The directory should contain only
336 FASTQ files as all the files within the
337 mentioned directory will be read. Ex: --
338 input /path/to/fastq_pass
339
340 --output : Absolute path to directory where all the
341 pipeline outputs should be stored. Ex: --
342 output /path/to/output
343
344 Other options :
345
346 --metadata : Absolute path to metadata CSV file
347 containing five mandatory columns: sample,
348 fq1,fq2,strandedness,single_end. The fq1
349 and fq2 columns contain absolute paths to
350 the FASTQ files. This option can be used in
351 place of --input option. This is rare. Ex: --
352 metadata samplesheet.csv
353
354 --fq_suffix : The suffix of FASTQ files (Unpaired reads
355 or R1 reads or Long reads) if an input
356 directory is mentioned via --input option.
357 Default: .fastq.gz
358
359 --fq2_suffix : The suffix of FASTQ files (Paired-end reads
360 or R2 reads) if an input directory is
361 mentioned via --input option. Default:
362 false
363
364 --fq_filter_by_len : Remove FASTQ reads that are less than this
365 many bases. Default: 4000
366
367 --fq_strandedness : The strandedness of the sequencing run.
368 This is mostly needed if your sequencing
369 run is RNA-SEQ. For most of the other runs,
370 it is probably safe to use unstranded for
371 the option. Default: unstranded
372
373 --fq_single_end : SINGLE-END information will be auto-
374 detected but this option forces PAIRED-END
375 FASTQ files to be treated as SINGLE-END so
376 only read 1 information is included in auto-
377 generated samplesheet. Default: false
378
379 --fq_filename_delim : Delimiter by which the file name is split
380 to obtain sample name. Default: _
381
382 --fq_filename_delim_idx : After splitting FASTQ file name by using
383 the --fq_filename_delim option, all
384 elements before this index (1-based) will
385 be joined to create final sample name.
386 Default: 1
387
388 --kraken2_db : Absolute path to kraken database. Default: /
389 hpc/db/kraken2/standard-210914
390
391 --kraken2_confidence : Confidence score threshold which must be
392 between 0 and 1. Default: 0.0
393
394 --kraken2_quick : Quick operation (use first hit or hits).
395 Default: false
396
397 --kraken2_use_mpa_style : Report output like Kraken 1's kraken-mpa-
398 report. Default: false
399
400 --kraken2_minimum_base_quality : Minimum base quality used in classification
401 which is only effective with FASTQ input.
402 Default: 0
403
404 --kraken2_report_zero_counts : Report counts for ALL taxa, even if counts
405 are zero. Default: false
406
407 --kraken2_report_minmizer_data : Report minimizer and distinct minimizer
408 count information in addition to normal
409 Kraken report. Default: false
410
411 --kraken2_use_names : Print scientific names instead of just
412 taxids. Default: true
413
414 --kraken2_extract_bug : Extract the reads or contigs beloging to
415 this bug. Default: Escherichia coli
416
417 --centrifuge_x : Absolute path to centrifuge database.
418 Default: /hpc/db/centrifuge/2022-04-12/ab
419
420 --centrifuge_save_unaligned : Save SINGLE-END reads that did not align.
421 For PAIRED-END reads, save read pairs that
422 did not align concordantly. Default: false
423
424 --centrifuge_save_aligned : Save SINGLE-END reads that aligned. For
425 PAIRED-END reads, save read pairs that
426 aligned concordantly. Default: false
427
428 --centrifuge_out_fmt_sam : Centrifuge output should be in SAM. Default:
429 false
430
431 --centrifuge_extract_bug : Extract this bug from centrifuge results.
432 Default: Escherichia coli
433
434 --centrifuge_ignore_quals : Treat all quality values as 30 on Phred
435 scale. Default: false
436
437 --flye_pacbio_raw : Input FASTQ reads are PacBio regular CLR
438 reads (<20% error) Defaut: false
439
440 --flye_pacbio_corr : Input FASTQ reads are PacBio reads that
441 were corrected with other methods (<3%
442 error). Default: false
443
444 --flye_pacbio_hifi : Input FASTQ reads are PacBio HiFi reads (<1%
445 error). Default: false
446
447 --flye_nano_raw : Input FASTQ reads are ONT regular reads,
448 pre-Guppy5 (<20% error). Default: true
449
450 --flye_nano_corr : Input FASTQ reads are ONT reads that were
451 corrected with other methods (<3% error).
452 Default: false
453
454 --flye_nano_hq : Input FASTQ reads are ONT high-quality
455 reads: Guppy5+ SUP or Q20 (<5% error).
456 Default: false
457
458 --flye_genome_size : Estimated genome size (for example, 5m or 2.
459 6g). Default: 5.5m
460
461 --flye_polish_iter : Number of genome polishing iterations.
462 Default: false
463
464 --flye_meta : Do a metagenome assembly (unenven coverage
465 mode). Default: true
466
467 --flye_min_overlap : Minimum overlap between reads. Default:
468 false
469
470 --flye_scaffold : Enable scaffolding using assembly graph.
471 Default: false
472
473 --serotypefinder_run : Run SerotypeFinder tool. Default: true
474
475 --serotypefinder_x : Generate extended output files. Default:
476 true
477
478 --serotypefinder_db : Path to SerotypeFinder databases. Default: /
479 hpc/db/serotypefinder/2.0.2
480
481 --serotypefinder_min_threshold : Minimum percent identity (in float)
482 required for calling a hit. Default: 0.85
483
484 --serotypefinder_min_cov : Minumum percent coverage (in float)
485 required for calling a hit. Default: 0.80
486
487 --seqsero2_run : Run SeqSero2 tool. Default: false
488
489 --seqsero2_t : '1' for interleaved paired-end reads, '2'
490 for separated paired-end reads, '3' for
491 single reads, '4' for genome assembly, '5'
492 for nanopore reads (fasta/fastq). Default:
493 4
494
495 --seqsero2_m : Which workflow to apply, 'a'(raw reads
496 allele micro-assembly), 'k'(raw reads and
497 genome assembly k-mer). Default: k
498
499 --seqsero2_c : SeqSero2 will only output serotype
500 prediction without the directory containing
501 log files. Default: false
502
503 --seqsero2_s : SeqSero2 will not output header in
504 SeqSero_result.tsv. Default: false
505
506 --mlst_run : Run MLST tool. Default: true
507
508 --mlst_minid : DNA %identity of full allelle to consider '
509 similar' [~]. Default: 95
510
511 --mlst_mincov : DNA %cov to report partial allele at all [?].
512 Default: 10
513
514 --mlst_minscore : Minumum score out of 100 to match a scheme.
515 Default: 50
516
517 --abricate_run : Run ABRicate tool. Default: true
518
519 --abricate_minid : Minimum DNA %identity. Defaut: 90
520
521 --abricate_mincov : Minimum DNA %coverage. Defaut: 80
522
523 --abricate_datadir : ABRicate databases folder. Defaut: /hpc/db/
524 abricate/1.0.1/db
525
526 Help options :
527
528 --help : Display this message.
529 ```
530
531 \
532 &nbsp;
533
534 ## `centriflaken_hy` CLI Help
535
536 ```text
537 cpipes --pipeline centriflaken_hy --help
538
539 N E X T F L O W ~ version 24.10.4
540
541 Launching `/home/user/centriflaken/cpipes` [big_ramanujan] DSL2 - revision: 55d6f63710
542
543 ================================================================================
544 (o)
545 ___ _ __ _ _ __ ___ ___
546 / __|| '_ \ | || '_ \ / _ \/ __|
547 | (__ | |_) || || |_) || __/\__ \
548 \___|| .__/ |_|| .__/ \___||___/
549 | | | |
550 |_| |_|
551 --------------------------------------------------------------------------------
552 A collection of modular pipelines at CFSAN, FDA.
553 --------------------------------------------------------------------------------
554 Name : CPIPES
555 Author : Kranti.Konganti@fda.hhs.gov
556 Version : 0.4.1
557 Center : CFSAN, FDA.
558 ================================================================================
559
560 Workflow : centriflaken_hy
561
562 Author : Kranti.Konganti@fda.hhs.gov
563
564 Version : 0.4.1
565
566
567 Usage : cpipes --pipeline centriflaken_hy [options]
568
569
570 Required :
571
572 --input : Absolute path to directory containing FASTQ
573 files. The directory should contain only
574 FASTQ files as all the files within the
575 mentioned directory will be read. Ex: --
576 input /path/to/fastq_pass
577
578 --output : Absolute path to directory where all the
579 pipeline outputs should be stored. Ex: --
580 output /path/to/output
581
582 Other options :
583
584 --metadata : Absolute path to metadata CSV file
585 containing five mandatory columns: sample,
586 fq1,fq2,strandedness,single_end. The fq1
587 and fq2 columns contain absolute paths to
588 the FASTQ files. This option can be used in
589 place of --input option. This is rare. Ex: --
590 metadata samplesheet.csv
591
592 --fq_suffix : The suffix of FASTQ files (Unpaired reads
593 or R1 reads or Long reads) if an input
594 directory is mentioned via --input option.
595 Default: _R1_001.fastq.gz
596
597 --fq2_suffix : The suffix of FASTQ files (Paired-end reads
598 or R2 reads) if an input directory is
599 mentioned via --input option. Default:
600 _R2_001.fastq.gz
601
602 --fq_filter_by_len : Remove FASTQ reads that are less than this
603 many bases. Default: 75
604
605 --fq_strandedness : The strandedness of the sequencing run.
606 This is mostly needed if your sequencing
607 run is RNA-SEQ. For most of the other runs,
608 it is probably safe to use unstranded for
609 the option. Default: unstranded
610
611 --fq_single_end : SINGLE-END information will be auto-
612 detected but this option forces PAIRED-END
613 FASTQ files to be treated as SINGLE-END so
614 only read 1 information is included in auto-
615 generated samplesheet. Default: false
616
617 --fq_filename_delim : Delimiter by which the file name is split
618 to obtain sample name. Default: _
619
620 --fq_filename_delim_idx : After splitting FASTQ file name by using
621 the --fq_filename_delim option, all
622 elements before this index (1-based) will
623 be joined to create final sample name.
624 Default: 1
625
626 --seqkit_rmdup_run : Remove duplicate sequences using seqkit
627 rmdup. Default: false
628
629 --seqkit_rmdup_n : Match and remove duplicate sequences by
630 full name instead of just ID. Defaut: false
631
632 --seqkit_rmdup_s : Match and remove duplicate sequences by
633 sequence content. Defaut: true
634
635 --seqkit_rmdup_d : Save the duplicated sequences to a file.
636 Defaut: false
637
638 --seqkit_rmdup_D : Save the number and list of duplicated
639 sequences to a file. Defaut: false
640
641 --seqkit_rmdup_i : Ignore case while using seqkit rmdup.
642 Defaut: false
643
644 --seqkit_rmdup_P : Only consider positive strand (i.e. 5')
645 when comparing by sequence content. Defaut:
646 false
647
648 --kraken2_db : Absolute path to kraken database. Default: /
649 hpc/db/kraken2/standard-210914
650
651 --kraken2_confidence : Confidence score threshold which must be
652 between 0 and 1. Default: 0.0
653
654 --kraken2_quick : Quick operation (use first hit or hits).
655 Default: false
656
657 --kraken2_use_mpa_style : Report output like Kraken 1's kraken-mpa-
658 report. Default: false
659
660 --kraken2_minimum_base_quality : Minimum base quality used in classification
661 which is only effective with FASTQ input.
662 Default: 0
663
664 --kraken2_report_zero_counts : Report counts for ALL taxa, even if counts
665 are zero. Default: false
666
667 --kraken2_report_minmizer_data : Report minimizer and distinct minimizer
668 count information in addition to normal
669 Kraken report. Default: false
670
671 --kraken2_use_names : Print scientific names instead of just
672 taxids. Default: true
673
674 --kraken2_extract_bug : Extract the reads or contigs beloging to
675 this bug. Default: Escherichia coli
676
677 --centrifuge_x : Absolute path to centrifuge database.
678 Default: /hpc/db/centrifuge/2022-04-12/ab
679
680 --centrifuge_save_unaligned : Save SINGLE-END reads that did not align.
681 For PAIRED-END reads, save read pairs that
682 did not align concordantly. Default: false
683
684 --centrifuge_save_aligned : Save SINGLE-END reads that aligned. For
685 PAIRED-END reads, save read pairs that
686 aligned concordantly. Default: false
687
688 --centrifuge_out_fmt_sam : Centrifuge output should be in SAM. Default:
689 false
690
691 --centrifuge_extract_bug : Extract this bug from centrifuge results.
692 Default: Escherichia coli
693
694 --centrifuge_ignore_quals : Treat all quality values as 30 on Phred
695 scale. Default: false
696
697 --megahit_run : Run MEGAHIT assembler. Default: true
698
699 --megahit_min_count : <int>. Minimum multiplicity for filtering (
700 k_min+1)-mers. Defaut: false
701
702 --megahit_k_list : Comma-separated list of kmer size. All
703 values must be odd, in the range 15-255,
704 increment should be <= 28. Ex: '21,29,39,59,
705 79,99,119,141'. Default: false
706
707 --megahit_no_mercy : Do not add mercy k-mers. Default: false
708
709 --megahit_bubble_level : <int>. Intensity of bubble merging (0-2), 0
710 to disable. Default: false
711
712 --megahit_merge_level : <l,s>. Merge complex bubbles of length <= l*
713 kmer_size and similarity >= s. Default:
714 false
715
716 --megahit_prune_level : <int>. Strength of low depth pruning (0-3).
717 Default: false
718
719 --megahit_prune_depth : <int>. Remove unitigs with avg k-mer depth
720 less than this value. Default: false
721
722 --megahit_low_local_ratio : <float>. Ratio threshold to define low
723 local coverage contigs. Default: false
724
725 --megahit_max_tip_len : <int>. remove tips less than this value [<
726 int> * k]. Default: false
727
728 --megahit_no_local : Disable local assembly. Default: false
729
730 --megahit_kmin_1pass : Use 1pass mode to build SdBG of k_min.
731 Default: false
732
733 --megahit_preset : <str>. Override a group of parameters.
734 Valid values are meta-sensitive which
735 enforces '--min-count 1 --k-list 21,29,39,
736 49,...,129,141', meta-large (large &
737 complex metagenomes, like soil) which
738 enforces '--k-min 27 --k-max 127 --k-step
739 10'. Default: meta-sensitive
740
741 --megahit_mem_flag : <int>. SdBG builder memory mode. 0: minimum;
742 1: moderate; 2: use all memory specified.
743 Default: 2
744
745 --megahit_min_contig_len : <int>. Minimum length of contigs to output.
746 Default: false
747
748 --spades_run : Run SPAdes assembler. Default: false
749
750 --spades_isolate : This flag is highly recommended for high-
751 coverage isolate and multi-cell data.
752 Defaut: false
753
754 --spades_sc : This flag is required for MDA (single-cell)
755 data. Default: false
756
757 --spades_meta : This flag is required for metagenomic data.
758 Default: true
759
760 --spades_bio : This flag is required for biosytheticSPAdes
761 mode. Default: false
762
763 --spades_corona : This flag is required for coronaSPAdes mode.
764 Default: false
765
766 --spades_rna : This flag is required for RNA-Seq data.
767 Default: false
768
769 --spades_plasmid : Runs plasmidSPAdes pipeline for plasmid
770 detection. Default: false
771
772 --spades_metaviral : Runs metaviralSPAdes pipeline for virus
773 detection. Default: false
774
775 --spades_metaplasmid : Runs metaplasmidSPAdes pipeline for plasmid
776 detection in metagenomics datasets. Default:
777 false
778
779 --spades_rnaviral : This flag enables virus assembly module
780 from RNA-Seq data. Default: false
781
782 --spades_iontorrent : This flag is required for IonTorrent data.
783 Default: false
784
785 --spades_only_assembler : Runs only the SPAdes assembler module (
786 without read error correction). Default:
787 false
788
789 --spades_careful : Tries to reduce the number of mismatches
790 and short indels in the assembly. Default:
791 false
792
793 --spades_cov_cutoff : Coverage cutoff value (a positive float
794 number). Default: false
795
796 --spades_k : List of k-mer sizes (must be odd and less
797 than 128). Default: false
798
799 --spades_hmm : Directory with custom hmms that replace the
800 default ones (very rare). Default: false
801
802 --serotypefinder_run : Run SerotypeFinder tool. Default: true
803
804 --serotypefinder_x : Generate extended output files. Default:
805 true
806
807 --serotypefinder_db : Path to SerotypeFinder databases. Default: /
808 hpc/db/serotypefinder/2.0.2
809
810 --serotypefinder_min_threshold : Minimum percent identity (in float)
811 required for calling a hit. Default: 0.85
812
813 --serotypefinder_min_cov : Minumum percent coverage (in float)
814 required for calling a hit. Default: 0.80
815
816 --seqsero2_run : Run SeqSero2 tool. Default: false
817
818 --seqsero2_t : '1' for interleaved paired-end reads, '2'
819 for separated paired-end reads, '3' for
820 single reads, '4' for genome assembly, '5'
821 for nanopore reads (fasta/fastq). Default:
822 4
823
824 --seqsero2_m : Which workflow to apply, 'a'(raw reads
825 allele micro-assembly), 'k'(raw reads and
826 genome assembly k-mer). Default: k
827
828 --seqsero2_c : SeqSero2 will only output serotype
829 prediction without the directory containing
830 log files. Default: false
831
832 --seqsero2_s : SeqSero2 will not output header in
833 SeqSero_result.tsv. Default: false
834
835 --mlst_run : Run MLST tool. Default: true
836
837 --mlst_minid : DNA %identity of full allelle to consider '
838 similar' [~]. Default: 95
839
840 --mlst_mincov : DNA %cov to report partial allele at all [?].
841 Default: 10
842
843 --mlst_minscore : Minumum score out of 100 to match a scheme.
844 Default: 50
845
846 --abricate_run : Run ABRicate tool. Default: true
847
848 --abricate_minid : Minimum DNA %identity. Defaut: 90
849
850 --abricate_mincov : Minimum DNA %coverage. Defaut: 80
851
852 --abricate_datadir : ABRicate databases folder. Defaut: /hpc/db/
853 abricate/1.0.1/db
854
855 Help options :
856
857 --help : Display this message.
858 ```