Mercurial > repos > galaxytrakr > hfp_centriflaken_awsbatch

diff 0.4.2/readme/centriflaken.md @ 0:082e0091e813 draft default tip
planemo upload
author: galaxytrakr
date: Fri, 29 May 2026 13:27:47 +0000
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/0.4.2/readme/centriflaken.md	Fri May 29 13:27:47 2026 +0000
@@ -0,0 +1,858 @@
+# centriflaken
+
+`centriflaken` is an automated precision metagenomics workflow for assembly and _in silico_ analyses of food-borne pathogens. `centriflaken` primarily fine-tuned for detecting and classifying Shiga toxin-producing **_Escherichia coli_** (**STEC**), can also be used for performing analyses on other food-borne pathogens such as **_Salmonella enterica_**.  `centriflaken` takes as input a UNIX path to FASTQ, generates MAGs, and performs in silico-based analysis for STECs as described in [Maguire et al. 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0245172).
+
+`centriflaken` works on both **Illumina** short reads and **Oxford Nanopore** long reads.
+
+It is written in **Nextflow** and is part of the modular data analysis pipelines at **HFP**.
+
+\
+&nbsp;
+
+<!-- TOC -->
+
+- [Minimum Requirements](#minimum-requirements)
+- [HFP GalaxyTrakr](#hfp-galaxytrakr)
+- [Usage and Examples](#usage-and-examples)
+  - [Databases](#databases)
+  - [Input](#input)
+    - [Illumina short reads](#illumina-short-reads)
+  - [Output](#output)
+  - [Computational resources](#computational-resources)
+  - [Runtime profiles](#runtime-profiles)
+  - [your_institution.config](#your_institutionconfig)
+  - [Test run](#test-run)
+- [centriflaken CLI Help](#centriflaken-cli-help)
+- [centriflaken_hy CLI Help](#centriflaken_hy-cli-help)
+
+<!-- /TOC -->
+
+\
+&nbsp;
+
+## Minimum Requirements
+
+1. [Nextflow version 24.10.4](https://github.com/nextflow-io/nextflow/releases/download/v24.10.4/nextflow).
+    - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
+    - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK):  [Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-21-ug/downloads-list.html).
+2. Either of `micromamba` (version `1.5.9`) or `docker` or `singularity` installed and made available in your `$PATH`.
+    - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
+    - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#linux-and-macos) and make sure that the `micromamba` binary is made available in your `$PATH`.
+    - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
+    - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.5.9`**.
+    - First check, if your version is other than `1.5.9` and if not, do the downgrade.
+
+        ```bash
+        micromamba --version
+        micromamba self-update --version 1.5.9 -c conda-forge
+        ```
+
+3. Minimum of 10 CPU cores and about 60 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
+
+\
+&nbsp;
+
+## HFP GalaxyTrakr
+
+The `centriflaken` pipeline is also available for use on the [Galaxy instance supported by HFP, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which [you can run the workflow using this protocol](https://www.protocols.io/view/centriflaken-an-automated-data-analysis-pipeline-f-kxygxzdbwv8j/v5).
+
+Please note that the pipeline on [HFP GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization.
+
+\
+&nbsp;
+
+## Usage and Examples
+
+Clone or download this repository and then call `cpipes`.
+
+```bash
+cpipes --pipeline centriflaken [options]
+```
+
+Alternatively, you can use `nextflow` to directly pull and run the pipeline.
+
+```bash
+nextflow pull CFSAN-Biostatistics/centriflaken
+nextflow list
+nextflow info CFSAN-Biostatistics/centriflaken
+nextflow run CFSAN-Biostatistics/centriflaken --pipeline centriflaken --help
+nextflow run CFSAN-Biostatistics/centriflaken --pipeline centriflaken_hy --help
+```
+
+\
+&nbsp;
+
+### Databases
+
+---
+
+The successful run of the workflow requires all of the following databases:
+
+- `kraken2`, `centrifuge`, `serotypefinder` and `abricate`: [Download](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/centriflaken_dbs.tar.bz2).
+
+Once you have downloaded the databases, uncompress and set the **UNIX** path's in the configuration files as follows:
+
+- [Line no. 4](../workflows/conf/centriflaken.config#L4): `centrifuge_x = /path/to/centriflaken_dbs/centrifuge/ab`. The `ab` prefix is necessary.
+- [Line no. 11](../workflows/conf/centriflaken_hy.config#L11): `centrifuge_x = /path/to/centriflaken_dbs/centrifuge/ab`. The `ab` prefix is necessary.
+- [Line no. 10](../workflows/conf/centriflaken.config#L10): `kraken2_db = /path/to/centriflaken_dbs/kraken2`.
+- [Line no. 17](../workflows/conf/centriflaken_hy.config#L17): `kraken2_db = /path/to/centriflaken_dbs/kraken2`.
+- [Line no. 36](../workflows/conf/centriflaken.config#L36): `serotypefinder_db = /path/to/centriflaken_dbs/serotypefinder`.
+- [Line no. 64](../workflows/conf/centriflaken_hy.config#L64): `serotypefinder_db = /path/to/centriflaken_dbs/serotypefinder`.
+- [Line no. 53](../workflows/conf/centriflaken.config#L53): `abricate_datadir = /path/to/centriflaken_dbs/abricate`.
+- [Line no. 81](../workflows/conf/centriflaken_hy.config#L81): `abricate_datadir = /path/to/centriflaken_dbs/abricate`.
+
+\
+&nbsp;
+
+### Input
+
+---
+
+The input to the workflow is a folder containing compressed (`.gz`) FASTQ files of long reads or short reads. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
+
+For example, if the directory contains FASTQ files as shown below:
+
+- KB-01_apple_L001_R1.fastq.gz
+- KB-01_apple_L001_R2.fastq.gz
+- KB-01_apple_L002_R1.fastq.gz
+- KB-01_apple_L002_R2.fastq.gz
+- KB-02_mango_L001_R1.fastq.gz
+- KB-02_mango_L001_R2.fastq.gz
+- KB-02_mango_L002_R1.fastq.gz
+- KB-02_mango_L002_R2.fastq.gz
+
+Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
+
+This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
+
+\
+&nbsp;
+
+### Illumina short reads
+
+---
+
+`centriflaken` was primarily developed for **ONT** long reads but also supports **Illumina** short reads. Use the `--pipeline centriflaken_hy` instead of `--pipeline centriflaken` to activate this feature. The `centriflaken_hy` variant of the pipeline uses `megahit` instead of `flye` to perform short read assembly. There is no other change needed from the user other than using the `--pipeline centriflaken_hy` parameter for Illumina short reads.
+
+\
+&nbsp;
+
+### Output
+
+---
+
+All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `centriflaken-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
+
+\
+&nbsp;
+
+### Computational resources
+
+---
+
+The workflows `centriflaken` and `centriflaken_hy` require at least a minimum of 60 GBs of memory to successfully finish the workflow.
+
+\
+&nbsp;
+
+### Runtime profiles
+
+---
+
+You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
+
+\
+&nbsp;
+
+Example:
+
+```bash
+cd /data/scratch/$USER
+mkdir nf-cpipes
+cd nf-cpipes
+cpipes \
+    --pipeline centriflaken \
+    --input /path/to/fastq_pass_dir \
+    --output /path/to/where/output/should/go \
+    -profile your_institution
+```
+
+The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-centriflaken` would hold all the **NEXTFLOW** related logs, reports and trace files.
+
+\
+&nbsp;
+
+### `your_institution.config`
+
+---
+
+In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
+
+\
+&nbsp;
+
+```groovy
+your_institution {
+    process.executor = 'sge'
+    process.queue = 'normal.q'
+    singularity.enabled = false
+    singularity.autoMounts = true
+    docker.enabled = false
+    params.enable_conda = true
+    conda.enabled = true
+    conda.useMicromamba = true
+    params.enable_module = false
+}
+```
+
+In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `centriflaken` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
+
+\
+&nbsp;
+
+### Cloud computing
+
+---
+
+You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
+
+\
+&nbsp;
+
+Example:
+
+```groovy
+my_aws_batch {
+    executor = 'awsbatch'
+    queue = 'my-batch-queue'
+    aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
+    aws.batch.region = 'us-east-1'
+    singularity.enabled = false
+    singularity.autoMounts = true
+    docker.enabled = true
+    params.conda_enabled = false
+    params.enable_module = false
+}
+```
+
+\
+&nbsp;
+
+### Test run
+
+---
+
+After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `centriflaken` pipeline on some subsampled reads belonging to the NCBI BioProject `PRJNA639799` as discussed in [Maguire _et al_](https://pmc.ncbi.nlm.nih.gov/articles/PMC10500926/).
+
+- Please note that the input reads are subsampled to validate the software install.
+- Download them [from S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/macguire_et_al_subsampled_reads.tar.bz2) (~ 20 GB).
+
+  | Samples                                                        | Biosample    | SRA accession | Flowcell |
+  |:---------------------------------------------------------------|:-------------|:--------------|:---------|
+  | FAL00958                                                       | SAMN46790801 | SRR32346290   | FAL00958 |
+  | FAL01198                                                       | SAMN46793213 | SRR32346289   | FAL01198 |
+  | FAL01556                                                       | SAMN46793220 | SRR32346278   | FAL01556 |
+  | ZymoBIOMICS Microbial Community DNA Standard R1                | SAMN46793392 | SRR32381322   | FAL11413 |
+  | ZymoBIOMICS Microbial Community DNA Standard R2                | SAMN46793393 | SRR32381321   | FAL01565 |
+  | ZymoBIOMICS Microbial Community Standard II - log distribution | SAMN46793397 | SRR32381320   | FAL01514 |
+
+- Download pre-formatted  databases (**MANDATORY**) [from S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/centriflaken_dbs.tar.bz2) (~ 47 GB).
+- One of the assembly jobs should fail to assemble the reads and the pipeline will ignore the failed assembly and finish to completion.
+- After successful download, untar and change the paths to the databases in **BOTH** the [long reads conf file](../workflows/conf/centriflaken.config) and [short reads conf file](../workflows/conf/centriflaken_hy.config) as described in the [Databases](#databases) section.
+- The following values should point to the UNIX paths of the downloaded databases.
+
+    ```bash
+    centrifuge_x = '/path/to/centrifuge/ab' # /ab suffix SHOULD NOT change. Only the /path/to/centrifuge changes to your specific UNIX path.
+    kraken2_db = '/path/to/kraken2'
+    serotypefinder_db = '/path/to/serotypefinder'
+    abricate_datadir = '/path/to/abricate'
+    amrfinderplus_db = '/hpc/db/amrfinderplus/3.10.24/latest' # IGNORE THIS PATH SINCE AMRFINDERPLUS SHOULD NOT BE RUN.
+    ```
+
+- It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline.
+
+  ```bash
+  realpath /hpc/scratch/user/input/srr
+  ```
+
+- Now run the workflow by ignoring quality values since these are simulated base qualities:
+
+    ```bash
+    cpipes \
+        --pipeline centriflaken \
+        --input /path/to/macguire_et_al_subsampled_reads \
+        --output /path/to/centriflaken_test_output \
+        -profile stdkondagac \
+        -resume
+    ```
+
+- After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.us-east-1.amazonaws.com/Kranti.Konganti/centriflaken/macquire_et_al_test_report.html).
+
+Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
+
+\
+&nbsp;
+
+## `centriflaken` CLI Help
+
+```text
+cpipes --pipeline centriflaken --help
+
+ N E X T F L O W   ~  version 24.10.4
+
+Launching `/home/user/centriflaken/cpipes` [sleepy_pauling] DSL2 - revision: 55d6f63710
+
+================================================================================
+             (o)                  
+  ___  _ __   _  _ __    ___  ___ 
+ / __|| '_ \ | || '_ \  / _ \/ __|
+| (__ | |_) || || |_) ||  __/\__ \
+ \___|| .__/ |_|| .__/  \___||___/
+      | |       | |               
+      |_|       |_|
+--------------------------------------------------------------------------------
+A collection of modular pipelines at CFSAN, FDA.
+--------------------------------------------------------------------------------
+Name                            : CPIPES
+Author                          : Kranti.Konganti@fda.hhs.gov
+Version                         : 0.4.1
+Center                          : CFSAN, FDA.
+================================================================================
+
+Workflow                        : centriflaken
+
+Author                          : Kranti.Konganti@fda.hhs.gov
+
+Version                         : 0.4.2
+
+
+Usage                           : cpipes --pipeline centriflaken [options]
+
+
+Required                        : 
+
+--input                         : Absolute path to directory containing FASTQ 
+                                  files. The directory should contain only 
+                                  FASTQ files as all the files within the 
+                                  mentioned directory will be read. Ex: --
+                                  input /path/to/fastq_pass
+
+--output                        : Absolute path to directory where all the 
+                                  pipeline outputs should be stored. Ex: --
+                                  output /path/to/output
+
+Other options                   : 
+
+--metadata                      : Absolute path to metadata CSV file 
+                                  containing five mandatory columns: sample,
+                                  fq1,fq2,strandedness,single_end. The fq1 
+                                  and fq2 columns contain absolute paths to 
+                                  the FASTQ files. This option can be used in 
+                                  place of --input option. This is rare. Ex: --
+                                  metadata samplesheet.csv
+
+--fq_suffix                     : The suffix of FASTQ files (Unpaired reads 
+                                  or R1 reads or Long reads) if an input 
+                                  directory is mentioned via --input option. 
+                                  Default: .fastq.gz
+
+--fq2_suffix                    : The suffix of FASTQ files (Paired-end reads 
+                                  or R2 reads) if an input directory is 
+                                  mentioned via --input option. Default: 
+                                  false
+
+--fq_filter_by_len              : Remove FASTQ reads that are less than this 
+                                  many bases. Default: 4000
+
+--fq_strandedness               : The strandedness of the sequencing run. 
+                                  This is mostly needed if your sequencing 
+                                  run is RNA-SEQ. For most of the other runs, 
+                                  it is probably safe to use unstranded for 
+                                  the option. Default: unstranded
+
+--fq_single_end                 : SINGLE-END information will be auto-
+                                  detected but this option forces PAIRED-END 
+                                  FASTQ files to be treated as SINGLE-END so 
+                                  only read 1 information is included in auto-
+                                  generated samplesheet. Default: false
+
+--fq_filename_delim             : Delimiter by which the file name is split 
+                                  to obtain sample name. Default: _
+
+--fq_filename_delim_idx         : After splitting FASTQ file name by using 
+                                  the --fq_filename_delim option, all 
+                                  elements before this index (1-based) will 
+                                  be joined to create final sample name. 
+                                  Default: 1
+
+--kraken2_db                    : Absolute path to kraken database. Default: /
+                                  hpc/db/kraken2/standard-210914
+
+--kraken2_confidence            : Confidence score threshold which must be 
+                                  between 0 and 1. Default: 0.0
+
+--kraken2_quick                 : Quick operation (use first hit or hits). 
+                                  Default: false
+
+--kraken2_use_mpa_style         : Report output like Kraken 1's kraken-mpa-
+                                  report. Default: false
+
+--kraken2_minimum_base_quality  : Minimum base quality used in classification  
+                                  which is only effective with FASTQ input. 
+                                  Default: 0
+
+--kraken2_report_zero_counts    : Report counts for ALL taxa, even if counts 
+                                  are zero. Default: false
+
+--kraken2_report_minmizer_data  : Report minimizer and distinct minimizer 
+                                  count information in addition to normal 
+                                  Kraken report. Default: false
+
+--kraken2_use_names             : Print scientific names instead of just 
+                                  taxids. Default: true
+
+--kraken2_extract_bug           : Extract the reads or contigs beloging to 
+                                  this bug. Default: Escherichia coli
+
+--centrifuge_x                  : Absolute path to centrifuge database. 
+                                  Default: /hpc/db/centrifuge/2022-04-12/ab
+
+--centrifuge_save_unaligned     : Save SINGLE-END reads that did not align. 
+                                  For PAIRED-END reads, save read pairs that 
+                                  did not align concordantly. Default: false
+
+--centrifuge_save_aligned       : Save SINGLE-END reads that aligned. For 
+                                  PAIRED-END reads, save read pairs that 
+                                  aligned concordantly. Default: false
+
+--centrifuge_out_fmt_sam        : Centrifuge output should be in SAM. Default: 
+                                  false
+
+--centrifuge_extract_bug        : Extract this bug from centrifuge results. 
+                                  Default: Escherichia coli
+
+--centrifuge_ignore_quals       : Treat all quality values as 30 on Phred 
+                                  scale. Default: false
+
+--flye_pacbio_raw               : Input FASTQ reads are PacBio regular CLR 
+                                  reads (<20% error) Defaut: false
+
+--flye_pacbio_corr              : Input FASTQ reads are PacBio reads that 
+                                  were corrected with other methods (<3% 
+                                  error). Default: false
+
+--flye_pacbio_hifi              : Input FASTQ reads are PacBio HiFi reads (<1% 
+                                  error). Default: false
+
+--flye_nano_raw                 : Input FASTQ reads are ONT regular reads, 
+                                  pre-Guppy5 (<20% error). Default: true
+
+--flye_nano_corr                : Input FASTQ reads are ONT reads that were 
+                                  corrected with other methods (<3% error). 
+                                  Default: false
+
+--flye_nano_hq                  : Input FASTQ reads are ONT high-quality 
+                                  reads: Guppy5+ SUP or Q20 (<5% error). 
+                                  Default: false
+
+--flye_genome_size              : Estimated genome size (for example, 5m or 2.
+                                  6g). Default: 5.5m
+
+--flye_polish_iter              : Number of genome polishing iterations. 
+                                  Default: false
+
+--flye_meta                     : Do a metagenome assembly (unenven coverage 
+                                  mode). Default: true
+
+--flye_min_overlap              : Minimum overlap between reads. Default: 
+                                  false
+
+--flye_scaffold                 : Enable scaffolding using assembly graph. 
+                                  Default: false
+
+--serotypefinder_run            : Run SerotypeFinder tool. Default: true
+
+--serotypefinder_x              : Generate extended output files. Default: 
+                                  true
+
+--serotypefinder_db             : Path to SerotypeFinder databases. Default: /
+                                  hpc/db/serotypefinder/2.0.2
+
+--serotypefinder_min_threshold  : Minimum percent identity (in float) 
+                                  required for calling a hit. Default: 0.85
+
+--serotypefinder_min_cov        : Minumum percent coverage (in float) 
+                                  required for calling a hit. Default: 0.80
+
+--seqsero2_run                  : Run SeqSero2 tool. Default: false
+
+--seqsero2_t                    : '1' for interleaved paired-end reads, '2' 
+                                  for separated paired-end reads, '3' for 
+                                  single reads, '4' for genome assembly, '5' 
+                                  for nanopore reads (fasta/fastq). Default: 
+                                  4
+
+--seqsero2_m                    : Which workflow to apply, 'a'(raw reads 
+                                  allele micro-assembly), 'k'(raw reads and 
+                                  genome assembly k-mer). Default: k
+
+--seqsero2_c                    : SeqSero2 will only output serotype 
+                                  prediction without the directory containing 
+                                  log files. Default: false
+
+--seqsero2_s                    : SeqSero2 will not output header in 
+                                  SeqSero_result.tsv. Default: false
+
+--mlst_run                      : Run MLST tool. Default: true
+
+--mlst_minid                    : DNA %identity of full allelle to consider '
+                                  similar' [~]. Default: 95
+
+--mlst_mincov                   : DNA %cov to report partial allele at all [?].
+                                  Default: 10
+
+--mlst_minscore                 : Minumum score out of 100 to match a scheme.
+                                  Default: 50
+
+--abricate_run                  : Run ABRicate tool. Default: true
+
+--abricate_minid                : Minimum DNA %identity. Defaut: 90
+
+--abricate_mincov               : Minimum DNA %coverage. Defaut: 80
+
+--abricate_datadir              : ABRicate databases folder. Defaut: /hpc/db/
+                                  abricate/1.0.1/db
+
+Help options                    : 
+
+--help                          : Display this message.
+```
+
+\
+&nbsp;
+
+## `centriflaken_hy` CLI Help
+
+```text
+cpipes --pipeline centriflaken_hy --help
+
+ N E X T F L O W   ~  version 24.10.4
+
+Launching `/home/user/centriflaken/cpipes` [big_ramanujan] DSL2 - revision: 55d6f63710
+
+================================================================================
+             (o)                  
+  ___  _ __   _  _ __    ___  ___ 
+ / __|| '_ \ | || '_ \  / _ \/ __|
+| (__ | |_) || || |_) ||  __/\__ \
+ \___|| .__/ |_|| .__/  \___||___/
+      | |       | |               
+      |_|       |_|
+--------------------------------------------------------------------------------
+A collection of modular pipelines at CFSAN, FDA.
+--------------------------------------------------------------------------------
+Name                            : CPIPES
+Author                          : Kranti.Konganti@fda.hhs.gov
+Version                         : 0.4.1
+Center                          : CFSAN, FDA.
+================================================================================
+
+Workflow                        : centriflaken_hy
+
+Author                          : Kranti.Konganti@fda.hhs.gov
+
+Version                         : 0.4.1
+
+
+Usage                           : cpipes --pipeline centriflaken_hy [options]
+
+
+Required                        : 
+
+--input                         : Absolute path to directory containing FASTQ 
+                                  files. The directory should contain only 
+                                  FASTQ files as all the files within the 
+                                  mentioned directory will be read. Ex: --
+                                  input /path/to/fastq_pass
+
+--output                        : Absolute path to directory where all the 
+                                  pipeline outputs should be stored. Ex: --
+                                  output /path/to/output
+
+Other options                   : 
+
+--metadata                      : Absolute path to metadata CSV file 
+                                  containing five mandatory columns: sample,
+                                  fq1,fq2,strandedness,single_end. The fq1 
+                                  and fq2 columns contain absolute paths to 
+                                  the FASTQ files. This option can be used in 
+                                  place of --input option. This is rare. Ex: --
+                                  metadata samplesheet.csv
+
+--fq_suffix                     : The suffix of FASTQ files (Unpaired reads 
+                                  or R1 reads or Long reads) if an input 
+                                  directory is mentioned via --input option. 
+                                  Default: _R1_001.fastq.gz
+
+--fq2_suffix                    : The suffix of FASTQ files (Paired-end reads 
+                                  or R2 reads) if an input directory is 
+                                  mentioned via --input option. Default: 
+                                  _R2_001.fastq.gz
+
+--fq_filter_by_len              : Remove FASTQ reads that are less than this 
+                                  many bases. Default: 75
+
+--fq_strandedness               : The strandedness of the sequencing run. 
+                                  This is mostly needed if your sequencing 
+                                  run is RNA-SEQ. For most of the other runs, 
+                                  it is probably safe to use unstranded for 
+                                  the option. Default: unstranded
+
+--fq_single_end                 : SINGLE-END information will be auto-
+                                  detected but this option forces PAIRED-END 
+                                  FASTQ files to be treated as SINGLE-END so 
+                                  only read 1 information is included in auto-
+                                  generated samplesheet. Default: false
+
+--fq_filename_delim             : Delimiter by which the file name is split 
+                                  to obtain sample name. Default: _
+
+--fq_filename_delim_idx         : After splitting FASTQ file name by using 
+                                  the --fq_filename_delim option, all 
+                                  elements before this index (1-based) will 
+                                  be joined to create final sample name. 
+                                  Default: 1
+
+--seqkit_rmdup_run              : Remove duplicate sequences using seqkit 
+                                  rmdup. Default: false
+
+--seqkit_rmdup_n                : Match and remove duplicate sequences by 
+                                  full name instead of just ID. Defaut: false
+
+--seqkit_rmdup_s                : Match and remove duplicate sequences by 
+                                  sequence content. Defaut: true
+
+--seqkit_rmdup_d                : Save the duplicated sequences to a file. 
+                                  Defaut: false
+
+--seqkit_rmdup_D                : Save the number and list of duplicated 
+                                  sequences to a file. Defaut: false
+
+--seqkit_rmdup_i                : Ignore case while using seqkit rmdup. 
+                                  Defaut: false
+
+--seqkit_rmdup_P                : Only consider positive strand (i.e. 5') 
+                                  when comparing by sequence content. Defaut: 
+                                  false
+
+--kraken2_db                    : Absolute path to kraken database. Default: /
+                                  hpc/db/kraken2/standard-210914
+
+--kraken2_confidence            : Confidence score threshold which must be 
+                                  between 0 and 1. Default: 0.0
+
+--kraken2_quick                 : Quick operation (use first hit or hits). 
+                                  Default: false
+
+--kraken2_use_mpa_style         : Report output like Kraken 1's kraken-mpa-
+                                  report. Default: false
+
+--kraken2_minimum_base_quality  : Minimum base quality used in classification  
+                                  which is only effective with FASTQ input. 
+                                  Default: 0
+
+--kraken2_report_zero_counts    : Report counts for ALL taxa, even if counts 
+                                  are zero. Default: false
+
+--kraken2_report_minmizer_data  : Report minimizer and distinct minimizer 
+                                  count information in addition to normal 
+                                  Kraken report. Default: false
+
+--kraken2_use_names             : Print scientific names instead of just 
+                                  taxids. Default: true
+
+--kraken2_extract_bug           : Extract the reads or contigs beloging to 
+                                  this bug. Default: Escherichia coli
+
+--centrifuge_x                  : Absolute path to centrifuge database. 
+                                  Default: /hpc/db/centrifuge/2022-04-12/ab
+
+--centrifuge_save_unaligned     : Save SINGLE-END reads that did not align. 
+                                  For PAIRED-END reads, save read pairs that 
+                                  did not align concordantly. Default: false
+
+--centrifuge_save_aligned       : Save SINGLE-END reads that aligned. For 
+                                  PAIRED-END reads, save read pairs that 
+                                  aligned concordantly. Default: false
+
+--centrifuge_out_fmt_sam        : Centrifuge output should be in SAM. Default: 
+                                  false
+
+--centrifuge_extract_bug        : Extract this bug from centrifuge results. 
+                                  Default: Escherichia coli
+
+--centrifuge_ignore_quals       : Treat all quality values as 30 on Phred 
+                                  scale. Default: false
+
+--megahit_run                   : Run MEGAHIT assembler. Default: true
+
+--megahit_min_count             : <int>. Minimum multiplicity for filtering (
+                                  k_min+1)-mers. Defaut: false
+
+--megahit_k_list                : Comma-separated list of kmer size. All 
+                                  values must be odd, in the range 15-255, 
+                                  increment should be <= 28. Ex: '21,29,39,59,
+                                  79,99,119,141'. Default: false
+
+--megahit_no_mercy              : Do not add mercy k-mers. Default: false
+
+--megahit_bubble_level          : <int>. Intensity of bubble merging (0-2), 0 
+                                  to disable. Default: false
+
+--megahit_merge_level           : <l,s>. Merge complex bubbles of length <= l*
+                                  kmer_size and similarity >= s. Default: 
+                                  false
+
+--megahit_prune_level           : <int>. Strength of low depth pruning (0-3). 
+                                  Default: false
+
+--megahit_prune_depth           : <int>. Remove unitigs with avg k-mer depth 
+                                  less than this value. Default: false
+
+--megahit_low_local_ratio       : <float>. Ratio threshold to define low 
+                                  local coverage contigs. Default: false
+
+--megahit_max_tip_len           : <int>. remove tips less than this value [<
+                                  int> * k]. Default: false
+
+--megahit_no_local              : Disable local assembly. Default: false
+
+--megahit_kmin_1pass            : Use 1pass mode to build SdBG of k_min. 
+                                  Default: false
+
+--megahit_preset                : <str>. Override a group of parameters. 
+                                  Valid values are meta-sensitive which 
+                                  enforces '--min-count 1 --k-list 21,29,39,
+                                  49,...,129,141', meta-large (large & 
+                                  complex metagenomes, like soil) which 
+                                  enforces '--k-min 27 --k-max 127 --k-step 
+                                  10'. Default: meta-sensitive
+
+--megahit_mem_flag              : <int>. SdBG builder memory mode. 0: minimum; 
+                                  1: moderate; 2: use all memory specified. 
+                                  Default: 2
+
+--megahit_min_contig_len        : <int>.  Minimum length of contigs to output. 
+                                  Default: false
+
+--spades_run                    : Run SPAdes assembler. Default: false
+
+--spades_isolate                : This flag is highly recommended for high-
+                                  coverage isolate and multi-cell data. 
+                                  Defaut: false
+
+--spades_sc                     : This flag is required for MDA (single-cell) 
+                                  data. Default: false
+
+--spades_meta                   : This flag is required for metagenomic data. 
+                                  Default: true
+
+--spades_bio                    : This flag is required for biosytheticSPAdes 
+                                  mode. Default: false
+
+--spades_corona                 : This flag is required for coronaSPAdes mode. 
+                                  Default: false
+
+--spades_rna                    : This flag is required for RNA-Seq data. 
+                                  Default: false
+
+--spades_plasmid                : Runs plasmidSPAdes pipeline for plasmid 
+                                  detection. Default: false
+
+--spades_metaviral              : Runs metaviralSPAdes pipeline for virus 
+                                  detection. Default: false
+
+--spades_metaplasmid            : Runs metaplasmidSPAdes pipeline for plasmid 
+                                  detection in metagenomics datasets. Default: 
+                                  false
+
+--spades_rnaviral               : This flag enables virus assembly module 
+                                  from RNA-Seq data. Default: false
+
+--spades_iontorrent             : This flag is required for IonTorrent data. 
+                                  Default: false
+
+--spades_only_assembler         : Runs only the SPAdes assembler module (
+                                  without read error correction). Default: 
+                                  false
+
+--spades_careful                : Tries to reduce the number of mismatches 
+                                  and short indels in the assembly. Default: 
+                                  false
+
+--spades_cov_cutoff             : Coverage cutoff value (a positive float 
+                                  number). Default: false
+
+--spades_k                      : List of k-mer sizes (must be odd and less 
+                                  than 128). Default: false
+
+--spades_hmm                    : Directory with custom hmms that replace the 
+                                  default ones (very rare). Default: false
+
+--serotypefinder_run            : Run SerotypeFinder tool. Default: true
+
+--serotypefinder_x              : Generate extended output files. Default: 
+                                  true
+
+--serotypefinder_db             : Path to SerotypeFinder databases. Default: /
+                                  hpc/db/serotypefinder/2.0.2
+
+--serotypefinder_min_threshold  : Minimum percent identity (in float) 
+                                  required for calling a hit. Default: 0.85
+
+--serotypefinder_min_cov        : Minumum percent coverage (in float) 
+                                  required for calling a hit. Default: 0.80
+
+--seqsero2_run                  : Run SeqSero2 tool. Default: false
+
+--seqsero2_t                    : '1' for interleaved paired-end reads, '2' 
+                                  for separated paired-end reads, '3' for 
+                                  single reads, '4' for genome assembly, '5' 
+                                  for nanopore reads (fasta/fastq). Default: 
+                                  4
+
+--seqsero2_m                    : Which workflow to apply, 'a'(raw reads 
+                                  allele micro-assembly), 'k'(raw reads and 
+                                  genome assembly k-mer). Default: k
+
+--seqsero2_c                    : SeqSero2 will only output serotype 
+                                  prediction without the directory containing 
+                                  log files. Default: false
+
+--seqsero2_s                    : SeqSero2 will not output header in 
+                                  SeqSero_result.tsv. Default: false
+
+--mlst_run                      : Run MLST tool. Default: true
+
+--mlst_minid                    : DNA %identity of full allelle to consider '
+                                  similar' [~]. Default: 95
+
+--mlst_mincov                   : DNA %cov to report partial allele at all [?].
+                                  Default: 10
+
+--mlst_minscore                 : Minumum score out of 100 to match a scheme.
+                                  Default: 50
+
+--abricate_run                  : Run ABRicate tool. Default: true
+
+--abricate_minid                : Minimum DNA %identity. Defaut: 90
+
+--abricate_mincov               : Minimum DNA %coverage. Defaut: 80
+
+--abricate_datadir              : ABRicate databases folder. Defaut: /hpc/db/
+                                  abricate/1.0.1/db
+
+Help options                    : 
+
+--help                          : Display this message.
+```
author	galaxytrakr
date	Fri, 29 May 2026 13:27:47 +0000
parents
children