Mercurial > repos > kkonganti > cfsan_centriflaken

# CPIPES (CFSAN PIPELINES)

## The modular pipeline repository at CFSAN, FDA

**CPIPES** (CFSAN PIPELINES) is a collection of modular pipelines based on **NEXTFLOW**,
mostly for bioinformatics data analysis at **CFSAN, FDA.**

---

### **centriflaken**

---
Precision long-read metagenomics sequencing for food safety by detection and assembly of Shiga toxin-producing Escherichia coli.

#### Workflow Usage

```bash
module load cpipes/0.4.0

cpipes --pipeline centriflaken [options]
```

Example: Run the default `centriflaken` pipeline with taxa of interest as *E. coli*.

```bash
cd /hpc/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes --pipeline centriflaken --input /path/to/fastq/dir --output /path/to/output --user_email 'Kranti.Konganti@fda.hhs.gov'
```

Example: Run the `centriflaken` pipeline with taxa of interest as *Salmonella*. In this mode, `SerotypeFinder` tool will be replaced with `SeqSero2` tool.

```bash
cd /hpc/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes --pipeline centriflaken --centrifuge_extract_bug 'Salmonella' --input /path/to/fastq/dir --output /path/to/output --user_email 'Kranti.Konganti@fda.hhs.gov'
```

#### `centriflaken` Help

```text
[Kranti.Konganti@login2-slurm ]$ cpipes --pipeline centriflaken --help
N E X T F L O W  ~  version 21.12.1-edge
Launching `/nfs/software/apps/cpipes/0.4.0/cpipes` [crazy_euler] - revision: 72db279311
================================================================================
             (o)
  ___  _ __   _  _ __    ___  ___
 / __|| '_ \ | || '_ \  / _ \/ __|
| (__ | |_) || || |_) ||  __/\__ \
 \___|| .__/ |_|| .__/  \___||___/
      | |       | |
      |_|       |_|
--------------------------------------------------------------------------------
A collection of modular pipelines at CFSAN, FDA.
--------------------------------------------------------------------------------
Name                            : CPIPES
Author                          : Kranti.Konganti@fda.hhs.gov
Version                         : 0.4.0
Center                          : CFSAN, FDA.
================================================================================

Workflow                        : centriflaken

Author                          : Kranti.Konganti@fda.hhs.gov

Version                         : 0.2.1


Usage                           : cpipes --pipeline centriflaken [options]


Required                        :

--input                         : Absolute path to directory containing FASTQ
                                  files. The directory should contain only
                                  FASTQ files as all the files within the
                                  mentioned directory will be read. Ex: --
                                  input /path/to/fastq_pass

--output                        : Absolute path to directory where all the
                                  pipeline outputs should be stored. Ex: --
                                  output /path/to/output

Other options                   :

--metadata                      : Absolute path to metadata CSV file
                                  containing five mandatory columns: sample,
                                  fq1,fq2,strandedness,single_end. The fq1
                                  and fq2 columns contain absolute paths to
                                  the FASTQ files. This option can be used in
                                  place of --input option. This is rare. Ex: --
                                  metadata samplesheet.csv

--fq_suffix                     : The suffix of FASTQ files (Unpaired reads
                                  or R1 reads or Long reads) if an input
                                  directory is mentioned via --input option.
                                  Default: .fastq.gz

--fq2_suffix                    : The suffix of FASTQ files (Paired-end reads
                                  or R2 reads) if an input directory is
                                  mentioned via --input option. Default:
                                  false

--fq_filter_by_len              : Remove FASTQ reads that are less than this
                                  many bases. Default: 4000

--fq_strandedness               : The strandedness of the sequencing run.
                                  This is mostly needed if your sequencing
                                  run is RNA-SEQ. For most of the other runs,
                                  it is probably safe to use unstranded for
                                  the option. Default: unstranded

--fq_single_end                 : SINGLE-END information will be auto-
                                  detected but this option forces PAIRED-END
                                  FASTQ files to be treated as SINGLE-END so
                                  only read 1 information is included in auto-
                                  generated samplesheet. Default: false

--fq_filename_delim             : Delimiter by which the file name is split
                                  to obtain sample name. Default: _

--fq_filename_delim_idx         : After splitting FASTQ file name by using
                                  the --fq_filename_delim option, all
                                  elements before this index (1-based) will
                                  be joined to create final sample name.
                                  Default: 1

--kraken2_db                    : Absolute path to kraken database. Default: /
                                  hpc/db/kraken2/standard-210914

--kraken2_confidence            : Confidence score threshold which must be
                                  between 0 and 1. Default: 0.0

--kraken2_quick                 : Quick operation (use first hit or hits).
                                  Default: false

--kraken2_use_mpa_style         : Report output like Kraken 1's kraken-mpa-
                                  report. Default: false

--kraken2_minimum_base_quality  : Minimum base quality used in classification
                                  which is only effective with FASTQ input.
                                  Default: 0

--kraken2_report_zero_counts    : Report counts for ALL taxa, even if counts
                                  are zero. Default: false

--kraken2_report_minmizer_data  : Report minimizer and distinct minimizer
                                  count information in addition to normal
                                  Kraken report. Default: false

--kraken2_use_names             : Print scientific names instead of just
                                  taxids. Default: true

--kraken2_extract_bug           : Extract the reads or contigs beloging to
                                  this bug. Default: Escherichia coli

--centrifuge_x                  : Absolute path to centrifuge database.
                                  Default: /hpc/db/centrifuge/2022-04-12/ab

--centrifuge_save_unaligned     : Save SINGLE-END reads that did not align.
                                  For PAIRED-END reads, save read pairs that
                                  did not align concordantly. Default: false

--centrifuge_save_aligned       : Save SINGLE-END reads that aligned. For
                                  PAIRED-END reads, save read pairs that
                                  aligned concordantly. Default: false

--centrifuge_out_fmt_sam        : Centrifuge output should be in SAM. Default:
                                  false

--centrifuge_extract_bug        : Extract this bug from centrifuge results.
                                  Default: Escherichia coli

--centrifuge_ignore_quals       : Treat all quality values as 30 on Phred
                                  scale. Default: false

--flye_pacbio_raw               : Input FASTQ reads are PacBio regular CLR
                                  reads (<20% error) Defaut: false

--flye_pacbio_corr              : Input FASTQ reads are PacBio reads that
                                  were corrected with other methods (<3%
                                  error). Default: false

--flye_pacbio_hifi              : Input FASTQ reads are PacBio HiFi reads (<1%
                                  error). Default: false

--flye_nano_raw                 : Input FASTQ reads are ONT regular reads,
                                  pre-Guppy5 (<20% error). Default: true

--flye_nano_corr                : Input FASTQ reads are ONT reads that were
                                  corrected with other methods (<3% error).
                                  Default: false

--flye_nano_hq                  : Input FASTQ reads are ONT high-quality
                                  reads: Guppy5+ SUP or Q20 (<5% error).
                                  Default: false

--flye_genome_size              : Estimated genome size (for example, 5m or 2.
                                  6g). Default: 5.5m

--flye_polish_iter              : Number of genome polishing iterations.
                                  Default: false

--flye_meta                     : Do a metagenome assembly (unenven coverage
                                  mode). Default: true

--flye_min_overlap              : Minimum overlap between reads. Default:
                                  false

--flye_scaffold                 : Enable scaffolding using assembly graph.
                                  Default: false

--serotypefinder_run            : Run SerotypeFinder tool. Default: true

--serotypefinder_x              : Generate extended output files. Default:
                                  true

--serotypefinder_db             : Path to SerotypeFinder databases. Default: /
                                  hpc/db/serotypefinder/2.0.2

--serotypefinder_min_threshold  : Minimum percent identity (in float)
                                  required for calling a hit. Default: 0.85

--serotypefinder_min_cov        : Minumum percent coverage (in float)
                                  required for calling a hit. Default: 0.80

--seqsero2_run                  : Run SeqSero2 tool. Default: false

--seqsero2_t                    : '1' for interleaved paired-end reads, '2'
                                  for separated paired-end reads, '3' for
                                  single reads, '4' for genome assembly, '5'
                                  for nanopore reads (fasta/fastq). Default:
                                  4

--seqsero2_m                    : Which workflow to apply, 'a'(raw reads
                                  allele micro-assembly), 'k'(raw reads and
                                  genome assembly k-mer). Default: k

--seqsero2_c                    : SeqSero2 will only output serotype
                                  prediction without the directory containing
                                  log files. Default: false

--seqsero2_s                    : SeqSero2 will not output header in
                                  SeqSero_result.tsv. Default: false

--mlst_run                      : Run MLST tool. Default: true

--mlst_minid                    : DNA %identity of full allelle to consider '
                                  similar' [~]. Default: 95

--mlst_mincov                   : DNA %cov to report partial allele at all [?].
                                  Default: 10

--mlst_minscore                 : Minumum score out of 100 to match a scheme.
                                  Default: 50

--abricate_run                  : Run ABRicate tool. Default: true

--abricate_minid                : Minimum DNA %identity. Defaut: 90

--abricate_mincov               : Minimum DNA %coverage. Defaut: 80

--abricate_datadir              : ABRicate databases folder. Defaut: /hpc/db/
                                  abricate/1.0.1/db

Help options                    :

--help                          : Display this message.
```

### **BETA**

---
The development of the modular structure and flow is an ongoing effort and may change depending on assessment of various computational topics and other considerations.
author	kkonganti
date	Thu, 27 Jun 2024 14:17:26 -0400
parents
children