Mercurial > repos > galaxytrakr > hfp_centriflaken_awsbatch
comparison 0.4.2/readme/centriflaken.md @ 0:082e0091e813 draft default tip
planemo upload
| author | galaxytrakr |
|---|---|
| date | Fri, 29 May 2026 13:27:47 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:082e0091e813 |
|---|---|
| 1 # centriflaken | |
| 2 | |
| 3 `centriflaken` is an automated precision metagenomics workflow for assembly and _in silico_ analyses of food-borne pathogens. `centriflaken` primarily fine-tuned for detecting and classifying Shiga toxin-producing **_Escherichia coli_** (**STEC**), can also be used for performing analyses on other food-borne pathogens such as **_Salmonella enterica_**. `centriflaken` takes as input a UNIX path to FASTQ, generates MAGs, and performs in silico-based analysis for STECs as described in [Maguire et al. 2021](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0245172). | |
| 4 | |
| 5 `centriflaken` works on both **Illumina** short reads and **Oxford Nanopore** long reads. | |
| 6 | |
| 7 It is written in **Nextflow** and is part of the modular data analysis pipelines at **HFP**. | |
| 8 | |
| 9 \ | |
| 10 | |
| 11 | |
| 12 <!-- TOC --> | |
| 13 | |
| 14 - [Minimum Requirements](#minimum-requirements) | |
| 15 - [HFP GalaxyTrakr](#hfp-galaxytrakr) | |
| 16 - [Usage and Examples](#usage-and-examples) | |
| 17 - [Databases](#databases) | |
| 18 - [Input](#input) | |
| 19 - [Illumina short reads](#illumina-short-reads) | |
| 20 - [Output](#output) | |
| 21 - [Computational resources](#computational-resources) | |
| 22 - [Runtime profiles](#runtime-profiles) | |
| 23 - [your_institution.config](#your_institutionconfig) | |
| 24 - [Test run](#test-run) | |
| 25 - [centriflaken CLI Help](#centriflaken-cli-help) | |
| 26 - [centriflaken_hy CLI Help](#centriflaken_hy-cli-help) | |
| 27 | |
| 28 <!-- /TOC --> | |
| 29 | |
| 30 \ | |
| 31 | |
| 32 | |
| 33 ## Minimum Requirements | |
| 34 | |
| 35 1. [Nextflow version 24.10.4](https://github.com/nextflow-io/nextflow/releases/download/v24.10.4/nextflow). | |
| 36 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`. | |
| 37 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-21-ug/downloads-list.html). | |
| 38 2. Either of `micromamba` (version `1.5.9`) or `docker` or `singularity` installed and made available in your `$PATH`. | |
| 39 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers. | |
| 40 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#linux-and-macos) and make sure that the `micromamba` binary is made available in your `$PATH`. | |
| 41 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned. | |
| 42 - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.5.9`**. | |
| 43 - First check, if your version is other than `1.5.9` and if not, do the downgrade. | |
| 44 | |
| 45 ```bash | |
| 46 micromamba --version | |
| 47 micromamba self-update --version 1.5.9 -c conda-forge | |
| 48 ``` | |
| 49 | |
| 50 3. Minimum of 10 CPU cores and about 60 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big. | |
| 51 | |
| 52 \ | |
| 53 | |
| 54 | |
| 55 ## HFP GalaxyTrakr | |
| 56 | |
| 57 The `centriflaken` pipeline is also available for use on the [Galaxy instance supported by HFP, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which [you can run the workflow using this protocol](https://www.protocols.io/view/centriflaken-an-automated-data-analysis-pipeline-f-kxygxzdbwv8j/v5). | |
| 58 | |
| 59 Please note that the pipeline on [HFP GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization. | |
| 60 | |
| 61 \ | |
| 62 | |
| 63 | |
| 64 ## Usage and Examples | |
| 65 | |
| 66 Clone or download this repository and then call `cpipes`. | |
| 67 | |
| 68 ```bash | |
| 69 cpipes --pipeline centriflaken [options] | |
| 70 ``` | |
| 71 | |
| 72 Alternatively, you can use `nextflow` to directly pull and run the pipeline. | |
| 73 | |
| 74 ```bash | |
| 75 nextflow pull CFSAN-Biostatistics/centriflaken | |
| 76 nextflow list | |
| 77 nextflow info CFSAN-Biostatistics/centriflaken | |
| 78 nextflow run CFSAN-Biostatistics/centriflaken --pipeline centriflaken --help | |
| 79 nextflow run CFSAN-Biostatistics/centriflaken --pipeline centriflaken_hy --help | |
| 80 ``` | |
| 81 | |
| 82 \ | |
| 83 | |
| 84 | |
| 85 ### Databases | |
| 86 | |
| 87 --- | |
| 88 | |
| 89 The successful run of the workflow requires all of the following databases: | |
| 90 | |
| 91 - `kraken2`, `centrifuge`, `serotypefinder` and `abricate`: [Download](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/centriflaken_dbs.tar.bz2). | |
| 92 | |
| 93 Once you have downloaded the databases, uncompress and set the **UNIX** path's in the configuration files as follows: | |
| 94 | |
| 95 - [Line no. 4](../workflows/conf/centriflaken.config#L4): `centrifuge_x = /path/to/centriflaken_dbs/centrifuge/ab`. The `ab` prefix is necessary. | |
| 96 - [Line no. 11](../workflows/conf/centriflaken_hy.config#L11): `centrifuge_x = /path/to/centriflaken_dbs/centrifuge/ab`. The `ab` prefix is necessary. | |
| 97 - [Line no. 10](../workflows/conf/centriflaken.config#L10): `kraken2_db = /path/to/centriflaken_dbs/kraken2`. | |
| 98 - [Line no. 17](../workflows/conf/centriflaken_hy.config#L17): `kraken2_db = /path/to/centriflaken_dbs/kraken2`. | |
| 99 - [Line no. 36](../workflows/conf/centriflaken.config#L36): `serotypefinder_db = /path/to/centriflaken_dbs/serotypefinder`. | |
| 100 - [Line no. 64](../workflows/conf/centriflaken_hy.config#L64): `serotypefinder_db = /path/to/centriflaken_dbs/serotypefinder`. | |
| 101 - [Line no. 53](../workflows/conf/centriflaken.config#L53): `abricate_datadir = /path/to/centriflaken_dbs/abricate`. | |
| 102 - [Line no. 81](../workflows/conf/centriflaken_hy.config#L81): `abricate_datadir = /path/to/centriflaken_dbs/abricate`. | |
| 103 | |
| 104 \ | |
| 105 | |
| 106 | |
| 107 ### Input | |
| 108 | |
| 109 --- | |
| 110 | |
| 111 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files of long reads or short reads. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1. | |
| 112 | |
| 113 For example, if the directory contains FASTQ files as shown below: | |
| 114 | |
| 115 - KB-01_apple_L001_R1.fastq.gz | |
| 116 - KB-01_apple_L001_R2.fastq.gz | |
| 117 - KB-01_apple_L002_R1.fastq.gz | |
| 118 - KB-01_apple_L002_R2.fastq.gz | |
| 119 - KB-02_mango_L001_R1.fastq.gz | |
| 120 - KB-02_mango_L001_R2.fastq.gz | |
| 121 - KB-02_mango_L002_R1.fastq.gz | |
| 122 - KB-02_mango_L002_R2.fastq.gz | |
| 123 | |
| 124 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`). | |
| 125 | |
| 126 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet. | |
| 127 | |
| 128 \ | |
| 129 | |
| 130 | |
| 131 ### Illumina short reads | |
| 132 | |
| 133 --- | |
| 134 | |
| 135 `centriflaken` was primarily developed for **ONT** long reads but also supports **Illumina** short reads. Use the `--pipeline centriflaken_hy` instead of `--pipeline centriflaken` to activate this feature. The `centriflaken_hy` variant of the pipeline uses `megahit` instead of `flye` to perform short read assembly. There is no other change needed from the user other than using the `--pipeline centriflaken_hy` parameter for Illumina short reads. | |
| 136 | |
| 137 \ | |
| 138 | |
| 139 | |
| 140 ### Output | |
| 141 | |
| 142 --- | |
| 143 | |
| 144 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `centriflaken-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report. | |
| 145 | |
| 146 \ | |
| 147 | |
| 148 | |
| 149 ### Computational resources | |
| 150 | |
| 151 --- | |
| 152 | |
| 153 The workflows `centriflaken` and `centriflaken_hy` require at least a minimum of 60 GBs of memory to successfully finish the workflow. | |
| 154 | |
| 155 \ | |
| 156 | |
| 157 | |
| 158 ### Runtime profiles | |
| 159 | |
| 160 --- | |
| 161 | |
| 162 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure. | |
| 163 | |
| 164 \ | |
| 165 | |
| 166 | |
| 167 Example: | |
| 168 | |
| 169 ```bash | |
| 170 cd /data/scratch/$USER | |
| 171 mkdir nf-cpipes | |
| 172 cd nf-cpipes | |
| 173 cpipes \ | |
| 174 --pipeline centriflaken \ | |
| 175 --input /path/to/fastq_pass_dir \ | |
| 176 --output /path/to/where/output/should/go \ | |
| 177 -profile your_institution | |
| 178 ``` | |
| 179 | |
| 180 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-centriflaken` would hold all the **NEXTFLOW** related logs, reports and trace files. | |
| 181 | |
| 182 \ | |
| 183 | |
| 184 | |
| 185 ### `your_institution.config` | |
| 186 | |
| 187 --- | |
| 188 | |
| 189 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines: | |
| 190 | |
| 191 \ | |
| 192 | |
| 193 | |
| 194 ```groovy | |
| 195 your_institution { | |
| 196 process.executor = 'sge' | |
| 197 process.queue = 'normal.q' | |
| 198 singularity.enabled = false | |
| 199 singularity.autoMounts = true | |
| 200 docker.enabled = false | |
| 201 params.enable_conda = true | |
| 202 conda.enabled = true | |
| 203 conda.useMicromamba = true | |
| 204 params.enable_module = false | |
| 205 } | |
| 206 ``` | |
| 207 | |
| 208 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `centriflaken` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion. | |
| 209 | |
| 210 \ | |
| 211 | |
| 212 | |
| 213 ### Cloud computing | |
| 214 | |
| 215 --- | |
| 216 | |
| 217 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html): | |
| 218 | |
| 219 \ | |
| 220 | |
| 221 | |
| 222 Example: | |
| 223 | |
| 224 ```groovy | |
| 225 my_aws_batch { | |
| 226 executor = 'awsbatch' | |
| 227 queue = 'my-batch-queue' | |
| 228 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' | |
| 229 aws.batch.region = 'us-east-1' | |
| 230 singularity.enabled = false | |
| 231 singularity.autoMounts = true | |
| 232 docker.enabled = true | |
| 233 params.conda_enabled = false | |
| 234 params.enable_module = false | |
| 235 } | |
| 236 ``` | |
| 237 | |
| 238 \ | |
| 239 | |
| 240 | |
| 241 ### Test run | |
| 242 | |
| 243 --- | |
| 244 | |
| 245 After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `centriflaken` pipeline on some subsampled reads belonging to the NCBI BioProject `PRJNA639799` as discussed in [Maguire _et al_](https://pmc.ncbi.nlm.nih.gov/articles/PMC10500926/). | |
| 246 | |
| 247 - Please note that the input reads are subsampled to validate the software install. | |
| 248 - Download them [from S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/macguire_et_al_subsampled_reads.tar.bz2) (~ 20 GB). | |
| 249 | |
| 250 | Samples | Biosample | SRA accession | Flowcell | | |
| 251 |:---------------------------------------------------------------|:-------------|:--------------|:---------| | |
| 252 | FAL00958 | SAMN46790801 | SRR32346290 | FAL00958 | | |
| 253 | FAL01198 | SAMN46793213 | SRR32346289 | FAL01198 | | |
| 254 | FAL01556 | SAMN46793220 | SRR32346278 | FAL01556 | | |
| 255 | ZymoBIOMICS Microbial Community DNA Standard R1 | SAMN46793392 | SRR32381322 | FAL11413 | | |
| 256 | ZymoBIOMICS Microbial Community DNA Standard R2 | SAMN46793393 | SRR32381321 | FAL01565 | | |
| 257 | ZymoBIOMICS Microbial Community Standard II - log distribution | SAMN46793397 | SRR32381320 | FAL01514 | | |
| 258 | |
| 259 - Download pre-formatted databases (**MANDATORY**) [from S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/centriflaken/centriflaken_dbs.tar.bz2) (~ 47 GB). | |
| 260 - One of the assembly jobs should fail to assemble the reads and the pipeline will ignore the failed assembly and finish to completion. | |
| 261 - After successful download, untar and change the paths to the databases in **BOTH** the [long reads conf file](../workflows/conf/centriflaken.config) and [short reads conf file](../workflows/conf/centriflaken_hy.config) as described in the [Databases](#databases) section. | |
| 262 - The following values should point to the UNIX paths of the downloaded databases. | |
| 263 | |
| 264 ```bash | |
| 265 centrifuge_x = '/path/to/centrifuge/ab' # /ab suffix SHOULD NOT change. Only the /path/to/centrifuge changes to your specific UNIX path. | |
| 266 kraken2_db = '/path/to/kraken2' | |
| 267 serotypefinder_db = '/path/to/serotypefinder' | |
| 268 abricate_datadir = '/path/to/abricate' | |
| 269 amrfinderplus_db = '/hpc/db/amrfinderplus/3.10.24/latest' # IGNORE THIS PATH SINCE AMRFINDERPLUS SHOULD NOT BE RUN. | |
| 270 ``` | |
| 271 | |
| 272 - It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline. | |
| 273 | |
| 274 ```bash | |
| 275 realpath /hpc/scratch/user/input/srr | |
| 276 ``` | |
| 277 | |
| 278 - Now run the workflow by ignoring quality values since these are simulated base qualities: | |
| 279 | |
| 280 ```bash | |
| 281 cpipes \ | |
| 282 --pipeline centriflaken \ | |
| 283 --input /path/to/macguire_et_al_subsampled_reads \ | |
| 284 --output /path/to/centriflaken_test_output \ | |
| 285 -profile stdkondagac \ | |
| 286 -resume | |
| 287 ``` | |
| 288 | |
| 289 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.us-east-1.amazonaws.com/Kranti.Konganti/centriflaken/macquire_et_al_test_report.html). | |
| 290 | |
| 291 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache. | |
| 292 | |
| 293 \ | |
| 294 | |
| 295 | |
| 296 ## `centriflaken` CLI Help | |
| 297 | |
| 298 ```text | |
| 299 cpipes --pipeline centriflaken --help | |
| 300 | |
| 301 N E X T F L O W ~ version 24.10.4 | |
| 302 | |
| 303 Launching `/home/user/centriflaken/cpipes` [sleepy_pauling] DSL2 - revision: 55d6f63710 | |
| 304 | |
| 305 ================================================================================ | |
| 306 (o) | |
| 307 ___ _ __ _ _ __ ___ ___ | |
| 308 / __|| '_ \ | || '_ \ / _ \/ __| | |
| 309 | (__ | |_) || || |_) || __/\__ \ | |
| 310 \___|| .__/ |_|| .__/ \___||___/ | |
| 311 | | | | | |
| 312 |_| |_| | |
| 313 -------------------------------------------------------------------------------- | |
| 314 A collection of modular pipelines at CFSAN, FDA. | |
| 315 -------------------------------------------------------------------------------- | |
| 316 Name : CPIPES | |
| 317 Author : Kranti.Konganti@fda.hhs.gov | |
| 318 Version : 0.4.1 | |
| 319 Center : CFSAN, FDA. | |
| 320 ================================================================================ | |
| 321 | |
| 322 Workflow : centriflaken | |
| 323 | |
| 324 Author : Kranti.Konganti@fda.hhs.gov | |
| 325 | |
| 326 Version : 0.4.2 | |
| 327 | |
| 328 | |
| 329 Usage : cpipes --pipeline centriflaken [options] | |
| 330 | |
| 331 | |
| 332 Required : | |
| 333 | |
| 334 --input : Absolute path to directory containing FASTQ | |
| 335 files. The directory should contain only | |
| 336 FASTQ files as all the files within the | |
| 337 mentioned directory will be read. Ex: -- | |
| 338 input /path/to/fastq_pass | |
| 339 | |
| 340 --output : Absolute path to directory where all the | |
| 341 pipeline outputs should be stored. Ex: -- | |
| 342 output /path/to/output | |
| 343 | |
| 344 Other options : | |
| 345 | |
| 346 --metadata : Absolute path to metadata CSV file | |
| 347 containing five mandatory columns: sample, | |
| 348 fq1,fq2,strandedness,single_end. The fq1 | |
| 349 and fq2 columns contain absolute paths to | |
| 350 the FASTQ files. This option can be used in | |
| 351 place of --input option. This is rare. Ex: -- | |
| 352 metadata samplesheet.csv | |
| 353 | |
| 354 --fq_suffix : The suffix of FASTQ files (Unpaired reads | |
| 355 or R1 reads or Long reads) if an input | |
| 356 directory is mentioned via --input option. | |
| 357 Default: .fastq.gz | |
| 358 | |
| 359 --fq2_suffix : The suffix of FASTQ files (Paired-end reads | |
| 360 or R2 reads) if an input directory is | |
| 361 mentioned via --input option. Default: | |
| 362 false | |
| 363 | |
| 364 --fq_filter_by_len : Remove FASTQ reads that are less than this | |
| 365 many bases. Default: 4000 | |
| 366 | |
| 367 --fq_strandedness : The strandedness of the sequencing run. | |
| 368 This is mostly needed if your sequencing | |
| 369 run is RNA-SEQ. For most of the other runs, | |
| 370 it is probably safe to use unstranded for | |
| 371 the option. Default: unstranded | |
| 372 | |
| 373 --fq_single_end : SINGLE-END information will be auto- | |
| 374 detected but this option forces PAIRED-END | |
| 375 FASTQ files to be treated as SINGLE-END so | |
| 376 only read 1 information is included in auto- | |
| 377 generated samplesheet. Default: false | |
| 378 | |
| 379 --fq_filename_delim : Delimiter by which the file name is split | |
| 380 to obtain sample name. Default: _ | |
| 381 | |
| 382 --fq_filename_delim_idx : After splitting FASTQ file name by using | |
| 383 the --fq_filename_delim option, all | |
| 384 elements before this index (1-based) will | |
| 385 be joined to create final sample name. | |
| 386 Default: 1 | |
| 387 | |
| 388 --kraken2_db : Absolute path to kraken database. Default: / | |
| 389 hpc/db/kraken2/standard-210914 | |
| 390 | |
| 391 --kraken2_confidence : Confidence score threshold which must be | |
| 392 between 0 and 1. Default: 0.0 | |
| 393 | |
| 394 --kraken2_quick : Quick operation (use first hit or hits). | |
| 395 Default: false | |
| 396 | |
| 397 --kraken2_use_mpa_style : Report output like Kraken 1's kraken-mpa- | |
| 398 report. Default: false | |
| 399 | |
| 400 --kraken2_minimum_base_quality : Minimum base quality used in classification | |
| 401 which is only effective with FASTQ input. | |
| 402 Default: 0 | |
| 403 | |
| 404 --kraken2_report_zero_counts : Report counts for ALL taxa, even if counts | |
| 405 are zero. Default: false | |
| 406 | |
| 407 --kraken2_report_minmizer_data : Report minimizer and distinct minimizer | |
| 408 count information in addition to normal | |
| 409 Kraken report. Default: false | |
| 410 | |
| 411 --kraken2_use_names : Print scientific names instead of just | |
| 412 taxids. Default: true | |
| 413 | |
| 414 --kraken2_extract_bug : Extract the reads or contigs beloging to | |
| 415 this bug. Default: Escherichia coli | |
| 416 | |
| 417 --centrifuge_x : Absolute path to centrifuge database. | |
| 418 Default: /hpc/db/centrifuge/2022-04-12/ab | |
| 419 | |
| 420 --centrifuge_save_unaligned : Save SINGLE-END reads that did not align. | |
| 421 For PAIRED-END reads, save read pairs that | |
| 422 did not align concordantly. Default: false | |
| 423 | |
| 424 --centrifuge_save_aligned : Save SINGLE-END reads that aligned. For | |
| 425 PAIRED-END reads, save read pairs that | |
| 426 aligned concordantly. Default: false | |
| 427 | |
| 428 --centrifuge_out_fmt_sam : Centrifuge output should be in SAM. Default: | |
| 429 false | |
| 430 | |
| 431 --centrifuge_extract_bug : Extract this bug from centrifuge results. | |
| 432 Default: Escherichia coli | |
| 433 | |
| 434 --centrifuge_ignore_quals : Treat all quality values as 30 on Phred | |
| 435 scale. Default: false | |
| 436 | |
| 437 --flye_pacbio_raw : Input FASTQ reads are PacBio regular CLR | |
| 438 reads (<20% error) Defaut: false | |
| 439 | |
| 440 --flye_pacbio_corr : Input FASTQ reads are PacBio reads that | |
| 441 were corrected with other methods (<3% | |
| 442 error). Default: false | |
| 443 | |
| 444 --flye_pacbio_hifi : Input FASTQ reads are PacBio HiFi reads (<1% | |
| 445 error). Default: false | |
| 446 | |
| 447 --flye_nano_raw : Input FASTQ reads are ONT regular reads, | |
| 448 pre-Guppy5 (<20% error). Default: true | |
| 449 | |
| 450 --flye_nano_corr : Input FASTQ reads are ONT reads that were | |
| 451 corrected with other methods (<3% error). | |
| 452 Default: false | |
| 453 | |
| 454 --flye_nano_hq : Input FASTQ reads are ONT high-quality | |
| 455 reads: Guppy5+ SUP or Q20 (<5% error). | |
| 456 Default: false | |
| 457 | |
| 458 --flye_genome_size : Estimated genome size (for example, 5m or 2. | |
| 459 6g). Default: 5.5m | |
| 460 | |
| 461 --flye_polish_iter : Number of genome polishing iterations. | |
| 462 Default: false | |
| 463 | |
| 464 --flye_meta : Do a metagenome assembly (unenven coverage | |
| 465 mode). Default: true | |
| 466 | |
| 467 --flye_min_overlap : Minimum overlap between reads. Default: | |
| 468 false | |
| 469 | |
| 470 --flye_scaffold : Enable scaffolding using assembly graph. | |
| 471 Default: false | |
| 472 | |
| 473 --serotypefinder_run : Run SerotypeFinder tool. Default: true | |
| 474 | |
| 475 --serotypefinder_x : Generate extended output files. Default: | |
| 476 true | |
| 477 | |
| 478 --serotypefinder_db : Path to SerotypeFinder databases. Default: / | |
| 479 hpc/db/serotypefinder/2.0.2 | |
| 480 | |
| 481 --serotypefinder_min_threshold : Minimum percent identity (in float) | |
| 482 required for calling a hit. Default: 0.85 | |
| 483 | |
| 484 --serotypefinder_min_cov : Minumum percent coverage (in float) | |
| 485 required for calling a hit. Default: 0.80 | |
| 486 | |
| 487 --seqsero2_run : Run SeqSero2 tool. Default: false | |
| 488 | |
| 489 --seqsero2_t : '1' for interleaved paired-end reads, '2' | |
| 490 for separated paired-end reads, '3' for | |
| 491 single reads, '4' for genome assembly, '5' | |
| 492 for nanopore reads (fasta/fastq). Default: | |
| 493 4 | |
| 494 | |
| 495 --seqsero2_m : Which workflow to apply, 'a'(raw reads | |
| 496 allele micro-assembly), 'k'(raw reads and | |
| 497 genome assembly k-mer). Default: k | |
| 498 | |
| 499 --seqsero2_c : SeqSero2 will only output serotype | |
| 500 prediction without the directory containing | |
| 501 log files. Default: false | |
| 502 | |
| 503 --seqsero2_s : SeqSero2 will not output header in | |
| 504 SeqSero_result.tsv. Default: false | |
| 505 | |
| 506 --mlst_run : Run MLST tool. Default: true | |
| 507 | |
| 508 --mlst_minid : DNA %identity of full allelle to consider ' | |
| 509 similar' [~]. Default: 95 | |
| 510 | |
| 511 --mlst_mincov : DNA %cov to report partial allele at all [?]. | |
| 512 Default: 10 | |
| 513 | |
| 514 --mlst_minscore : Minumum score out of 100 to match a scheme. | |
| 515 Default: 50 | |
| 516 | |
| 517 --abricate_run : Run ABRicate tool. Default: true | |
| 518 | |
| 519 --abricate_minid : Minimum DNA %identity. Defaut: 90 | |
| 520 | |
| 521 --abricate_mincov : Minimum DNA %coverage. Defaut: 80 | |
| 522 | |
| 523 --abricate_datadir : ABRicate databases folder. Defaut: /hpc/db/ | |
| 524 abricate/1.0.1/db | |
| 525 | |
| 526 Help options : | |
| 527 | |
| 528 --help : Display this message. | |
| 529 ``` | |
| 530 | |
| 531 \ | |
| 532 | |
| 533 | |
| 534 ## `centriflaken_hy` CLI Help | |
| 535 | |
| 536 ```text | |
| 537 cpipes --pipeline centriflaken_hy --help | |
| 538 | |
| 539 N E X T F L O W ~ version 24.10.4 | |
| 540 | |
| 541 Launching `/home/user/centriflaken/cpipes` [big_ramanujan] DSL2 - revision: 55d6f63710 | |
| 542 | |
| 543 ================================================================================ | |
| 544 (o) | |
| 545 ___ _ __ _ _ __ ___ ___ | |
| 546 / __|| '_ \ | || '_ \ / _ \/ __| | |
| 547 | (__ | |_) || || |_) || __/\__ \ | |
| 548 \___|| .__/ |_|| .__/ \___||___/ | |
| 549 | | | | | |
| 550 |_| |_| | |
| 551 -------------------------------------------------------------------------------- | |
| 552 A collection of modular pipelines at CFSAN, FDA. | |
| 553 -------------------------------------------------------------------------------- | |
| 554 Name : CPIPES | |
| 555 Author : Kranti.Konganti@fda.hhs.gov | |
| 556 Version : 0.4.1 | |
| 557 Center : CFSAN, FDA. | |
| 558 ================================================================================ | |
| 559 | |
| 560 Workflow : centriflaken_hy | |
| 561 | |
| 562 Author : Kranti.Konganti@fda.hhs.gov | |
| 563 | |
| 564 Version : 0.4.1 | |
| 565 | |
| 566 | |
| 567 Usage : cpipes --pipeline centriflaken_hy [options] | |
| 568 | |
| 569 | |
| 570 Required : | |
| 571 | |
| 572 --input : Absolute path to directory containing FASTQ | |
| 573 files. The directory should contain only | |
| 574 FASTQ files as all the files within the | |
| 575 mentioned directory will be read. Ex: -- | |
| 576 input /path/to/fastq_pass | |
| 577 | |
| 578 --output : Absolute path to directory where all the | |
| 579 pipeline outputs should be stored. Ex: -- | |
| 580 output /path/to/output | |
| 581 | |
| 582 Other options : | |
| 583 | |
| 584 --metadata : Absolute path to metadata CSV file | |
| 585 containing five mandatory columns: sample, | |
| 586 fq1,fq2,strandedness,single_end. The fq1 | |
| 587 and fq2 columns contain absolute paths to | |
| 588 the FASTQ files. This option can be used in | |
| 589 place of --input option. This is rare. Ex: -- | |
| 590 metadata samplesheet.csv | |
| 591 | |
| 592 --fq_suffix : The suffix of FASTQ files (Unpaired reads | |
| 593 or R1 reads or Long reads) if an input | |
| 594 directory is mentioned via --input option. | |
| 595 Default: _R1_001.fastq.gz | |
| 596 | |
| 597 --fq2_suffix : The suffix of FASTQ files (Paired-end reads | |
| 598 or R2 reads) if an input directory is | |
| 599 mentioned via --input option. Default: | |
| 600 _R2_001.fastq.gz | |
| 601 | |
| 602 --fq_filter_by_len : Remove FASTQ reads that are less than this | |
| 603 many bases. Default: 75 | |
| 604 | |
| 605 --fq_strandedness : The strandedness of the sequencing run. | |
| 606 This is mostly needed if your sequencing | |
| 607 run is RNA-SEQ. For most of the other runs, | |
| 608 it is probably safe to use unstranded for | |
| 609 the option. Default: unstranded | |
| 610 | |
| 611 --fq_single_end : SINGLE-END information will be auto- | |
| 612 detected but this option forces PAIRED-END | |
| 613 FASTQ files to be treated as SINGLE-END so | |
| 614 only read 1 information is included in auto- | |
| 615 generated samplesheet. Default: false | |
| 616 | |
| 617 --fq_filename_delim : Delimiter by which the file name is split | |
| 618 to obtain sample name. Default: _ | |
| 619 | |
| 620 --fq_filename_delim_idx : After splitting FASTQ file name by using | |
| 621 the --fq_filename_delim option, all | |
| 622 elements before this index (1-based) will | |
| 623 be joined to create final sample name. | |
| 624 Default: 1 | |
| 625 | |
| 626 --seqkit_rmdup_run : Remove duplicate sequences using seqkit | |
| 627 rmdup. Default: false | |
| 628 | |
| 629 --seqkit_rmdup_n : Match and remove duplicate sequences by | |
| 630 full name instead of just ID. Defaut: false | |
| 631 | |
| 632 --seqkit_rmdup_s : Match and remove duplicate sequences by | |
| 633 sequence content. Defaut: true | |
| 634 | |
| 635 --seqkit_rmdup_d : Save the duplicated sequences to a file. | |
| 636 Defaut: false | |
| 637 | |
| 638 --seqkit_rmdup_D : Save the number and list of duplicated | |
| 639 sequences to a file. Defaut: false | |
| 640 | |
| 641 --seqkit_rmdup_i : Ignore case while using seqkit rmdup. | |
| 642 Defaut: false | |
| 643 | |
| 644 --seqkit_rmdup_P : Only consider positive strand (i.e. 5') | |
| 645 when comparing by sequence content. Defaut: | |
| 646 false | |
| 647 | |
| 648 --kraken2_db : Absolute path to kraken database. Default: / | |
| 649 hpc/db/kraken2/standard-210914 | |
| 650 | |
| 651 --kraken2_confidence : Confidence score threshold which must be | |
| 652 between 0 and 1. Default: 0.0 | |
| 653 | |
| 654 --kraken2_quick : Quick operation (use first hit or hits). | |
| 655 Default: false | |
| 656 | |
| 657 --kraken2_use_mpa_style : Report output like Kraken 1's kraken-mpa- | |
| 658 report. Default: false | |
| 659 | |
| 660 --kraken2_minimum_base_quality : Minimum base quality used in classification | |
| 661 which is only effective with FASTQ input. | |
| 662 Default: 0 | |
| 663 | |
| 664 --kraken2_report_zero_counts : Report counts for ALL taxa, even if counts | |
| 665 are zero. Default: false | |
| 666 | |
| 667 --kraken2_report_minmizer_data : Report minimizer and distinct minimizer | |
| 668 count information in addition to normal | |
| 669 Kraken report. Default: false | |
| 670 | |
| 671 --kraken2_use_names : Print scientific names instead of just | |
| 672 taxids. Default: true | |
| 673 | |
| 674 --kraken2_extract_bug : Extract the reads or contigs beloging to | |
| 675 this bug. Default: Escherichia coli | |
| 676 | |
| 677 --centrifuge_x : Absolute path to centrifuge database. | |
| 678 Default: /hpc/db/centrifuge/2022-04-12/ab | |
| 679 | |
| 680 --centrifuge_save_unaligned : Save SINGLE-END reads that did not align. | |
| 681 For PAIRED-END reads, save read pairs that | |
| 682 did not align concordantly. Default: false | |
| 683 | |
| 684 --centrifuge_save_aligned : Save SINGLE-END reads that aligned. For | |
| 685 PAIRED-END reads, save read pairs that | |
| 686 aligned concordantly. Default: false | |
| 687 | |
| 688 --centrifuge_out_fmt_sam : Centrifuge output should be in SAM. Default: | |
| 689 false | |
| 690 | |
| 691 --centrifuge_extract_bug : Extract this bug from centrifuge results. | |
| 692 Default: Escherichia coli | |
| 693 | |
| 694 --centrifuge_ignore_quals : Treat all quality values as 30 on Phred | |
| 695 scale. Default: false | |
| 696 | |
| 697 --megahit_run : Run MEGAHIT assembler. Default: true | |
| 698 | |
| 699 --megahit_min_count : <int>. Minimum multiplicity for filtering ( | |
| 700 k_min+1)-mers. Defaut: false | |
| 701 | |
| 702 --megahit_k_list : Comma-separated list of kmer size. All | |
| 703 values must be odd, in the range 15-255, | |
| 704 increment should be <= 28. Ex: '21,29,39,59, | |
| 705 79,99,119,141'. Default: false | |
| 706 | |
| 707 --megahit_no_mercy : Do not add mercy k-mers. Default: false | |
| 708 | |
| 709 --megahit_bubble_level : <int>. Intensity of bubble merging (0-2), 0 | |
| 710 to disable. Default: false | |
| 711 | |
| 712 --megahit_merge_level : <l,s>. Merge complex bubbles of length <= l* | |
| 713 kmer_size and similarity >= s. Default: | |
| 714 false | |
| 715 | |
| 716 --megahit_prune_level : <int>. Strength of low depth pruning (0-3). | |
| 717 Default: false | |
| 718 | |
| 719 --megahit_prune_depth : <int>. Remove unitigs with avg k-mer depth | |
| 720 less than this value. Default: false | |
| 721 | |
| 722 --megahit_low_local_ratio : <float>. Ratio threshold to define low | |
| 723 local coverage contigs. Default: false | |
| 724 | |
| 725 --megahit_max_tip_len : <int>. remove tips less than this value [< | |
| 726 int> * k]. Default: false | |
| 727 | |
| 728 --megahit_no_local : Disable local assembly. Default: false | |
| 729 | |
| 730 --megahit_kmin_1pass : Use 1pass mode to build SdBG of k_min. | |
| 731 Default: false | |
| 732 | |
| 733 --megahit_preset : <str>. Override a group of parameters. | |
| 734 Valid values are meta-sensitive which | |
| 735 enforces '--min-count 1 --k-list 21,29,39, | |
| 736 49,...,129,141', meta-large (large & | |
| 737 complex metagenomes, like soil) which | |
| 738 enforces '--k-min 27 --k-max 127 --k-step | |
| 739 10'. Default: meta-sensitive | |
| 740 | |
| 741 --megahit_mem_flag : <int>. SdBG builder memory mode. 0: minimum; | |
| 742 1: moderate; 2: use all memory specified. | |
| 743 Default: 2 | |
| 744 | |
| 745 --megahit_min_contig_len : <int>. Minimum length of contigs to output. | |
| 746 Default: false | |
| 747 | |
| 748 --spades_run : Run SPAdes assembler. Default: false | |
| 749 | |
| 750 --spades_isolate : This flag is highly recommended for high- | |
| 751 coverage isolate and multi-cell data. | |
| 752 Defaut: false | |
| 753 | |
| 754 --spades_sc : This flag is required for MDA (single-cell) | |
| 755 data. Default: false | |
| 756 | |
| 757 --spades_meta : This flag is required for metagenomic data. | |
| 758 Default: true | |
| 759 | |
| 760 --spades_bio : This flag is required for biosytheticSPAdes | |
| 761 mode. Default: false | |
| 762 | |
| 763 --spades_corona : This flag is required for coronaSPAdes mode. | |
| 764 Default: false | |
| 765 | |
| 766 --spades_rna : This flag is required for RNA-Seq data. | |
| 767 Default: false | |
| 768 | |
| 769 --spades_plasmid : Runs plasmidSPAdes pipeline for plasmid | |
| 770 detection. Default: false | |
| 771 | |
| 772 --spades_metaviral : Runs metaviralSPAdes pipeline for virus | |
| 773 detection. Default: false | |
| 774 | |
| 775 --spades_metaplasmid : Runs metaplasmidSPAdes pipeline for plasmid | |
| 776 detection in metagenomics datasets. Default: | |
| 777 false | |
| 778 | |
| 779 --spades_rnaviral : This flag enables virus assembly module | |
| 780 from RNA-Seq data. Default: false | |
| 781 | |
| 782 --spades_iontorrent : This flag is required for IonTorrent data. | |
| 783 Default: false | |
| 784 | |
| 785 --spades_only_assembler : Runs only the SPAdes assembler module ( | |
| 786 without read error correction). Default: | |
| 787 false | |
| 788 | |
| 789 --spades_careful : Tries to reduce the number of mismatches | |
| 790 and short indels in the assembly. Default: | |
| 791 false | |
| 792 | |
| 793 --spades_cov_cutoff : Coverage cutoff value (a positive float | |
| 794 number). Default: false | |
| 795 | |
| 796 --spades_k : List of k-mer sizes (must be odd and less | |
| 797 than 128). Default: false | |
| 798 | |
| 799 --spades_hmm : Directory with custom hmms that replace the | |
| 800 default ones (very rare). Default: false | |
| 801 | |
| 802 --serotypefinder_run : Run SerotypeFinder tool. Default: true | |
| 803 | |
| 804 --serotypefinder_x : Generate extended output files. Default: | |
| 805 true | |
| 806 | |
| 807 --serotypefinder_db : Path to SerotypeFinder databases. Default: / | |
| 808 hpc/db/serotypefinder/2.0.2 | |
| 809 | |
| 810 --serotypefinder_min_threshold : Minimum percent identity (in float) | |
| 811 required for calling a hit. Default: 0.85 | |
| 812 | |
| 813 --serotypefinder_min_cov : Minumum percent coverage (in float) | |
| 814 required for calling a hit. Default: 0.80 | |
| 815 | |
| 816 --seqsero2_run : Run SeqSero2 tool. Default: false | |
| 817 | |
| 818 --seqsero2_t : '1' for interleaved paired-end reads, '2' | |
| 819 for separated paired-end reads, '3' for | |
| 820 single reads, '4' for genome assembly, '5' | |
| 821 for nanopore reads (fasta/fastq). Default: | |
| 822 4 | |
| 823 | |
| 824 --seqsero2_m : Which workflow to apply, 'a'(raw reads | |
| 825 allele micro-assembly), 'k'(raw reads and | |
| 826 genome assembly k-mer). Default: k | |
| 827 | |
| 828 --seqsero2_c : SeqSero2 will only output serotype | |
| 829 prediction without the directory containing | |
| 830 log files. Default: false | |
| 831 | |
| 832 --seqsero2_s : SeqSero2 will not output header in | |
| 833 SeqSero_result.tsv. Default: false | |
| 834 | |
| 835 --mlst_run : Run MLST tool. Default: true | |
| 836 | |
| 837 --mlst_minid : DNA %identity of full allelle to consider ' | |
| 838 similar' [~]. Default: 95 | |
| 839 | |
| 840 --mlst_mincov : DNA %cov to report partial allele at all [?]. | |
| 841 Default: 10 | |
| 842 | |
| 843 --mlst_minscore : Minumum score out of 100 to match a scheme. | |
| 844 Default: 50 | |
| 845 | |
| 846 --abricate_run : Run ABRicate tool. Default: true | |
| 847 | |
| 848 --abricate_minid : Minimum DNA %identity. Defaut: 90 | |
| 849 | |
| 850 --abricate_mincov : Minimum DNA %coverage. Defaut: 80 | |
| 851 | |
| 852 --abricate_datadir : ABRicate databases folder. Defaut: /hpc/db/ | |
| 853 abricate/1.0.1/db | |
| 854 | |
| 855 Help options : | |
| 856 | |
| 857 --help : Display this message. | |
| 858 ``` |
