Mercurial > repos > kkonganti > cfsan_centriflaken

# CPIPES (CFSAN PIPELINES)

## The modular pipeline repository at CFSAN, FDA

**CPIPES** (CFSAN PIPELINES) is a collection of modular pipelines based on **NEXTFLOW**,
mostly for bioinformatics data analysis at **CFSAN, FDA.**

---

### **centriflaken_hy**

---
`centriflaken_hy` is a variant of the original `centriflaken` pipeline but for Illumina short reads either single-end or paired-end.

#### Workflow Usage

```bash
module load cpipes/0.4.0

cpipes --pipeline centriflaken_hy [options]
```

Example: Run the default `centriflaken_hy` pipeline with taxa of interest as *E. coli*.

```bash
cd /hpc/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes --pipeline centriflaken_hy --input /path/to/illumina/fastq/dir --output /path/to/output --user_email 'Kranti.Konganti@fda.hhs.gov'
```

Example: Run the `centriflaken_hy` pipeline with taxa of interest as *Salmonella*. In this mode, `SerotypeFinder` tool will be replaced with `SeqSero2` tool.

```bash
cd /hpc/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes --pipeline centriflaken_hy --centrifuge_extract_bug 'Salmonella' --input /path/to/illumina/fastq/dir --output /path/to/output --user_email 'Kranti.Konganti@fda.hhs.gov'
```

#### `centriflaken_hy` Help

```text
[Kranti.Konganti@login2-slurm ]$ cpipes --pipeline centriflaken_hy --help
N E X T F L O W  ~  version 21.12.1-edge
Launching `/home/Kranti.Konganti/apps/cpipes/cpipes` [soggy_curie] - revision: 72db279311
================================================================================
             (o)
  ___  _ __   _  _ __    ___  ___
 / __|| '_ \ | || '_ \  / _ \/ __|
| (__ | |_) || || |_) ||  __/\__ \
 \___|| .__/ |_|| .__/  \___||___/
      | |       | |
      |_|       |_|
--------------------------------------------------------------------------------
A collection of modular pipelines at CFSAN, FDA.
--------------------------------------------------------------------------------
Name                            : CPIPES
Author                          : Kranti.Konganti@fda.hhs.gov
Version                         : 0.4.0
Center                          : CFSAN, FDA.
================================================================================

Workflow                        : centriflaken_hy

Author                          : Kranti.Konganti@fda.hhs.gov

Version                         : 0.4.0


Usage                           : cpipes --pipeline centriflaken_hy [options]


Required                        :

--input                         : Absolute path to directory containing FASTQ
                                  files. The directory should contain only
                                  FASTQ files as all the files within the
                                  mentioned directory will be read. Ex: --
                                  input /path/to/fastq_pass

--output                        : Absolute path to directory where all the
                                  pipeline outputs should be stored. Ex: --
                                  output /path/to/output

Other options                   :

--metadata                      : Absolute path to metadata CSV file
                                  containing five mandatory columns: sample,
                                  fq1,fq2,strandedness,single_end. The fq1
                                  and fq2 columns contain absolute paths to
                                  the FASTQ files. This option can be used in
                                  place of --input option. This is rare. Ex: --
                                  metadata samplesheet.csv

--fq_suffix                     : The suffix of FASTQ files (Unpaired reads
                                  or R1 reads or Long reads) if an input
                                  directory is mentioned via --input option.
                                  Default: _R1_001.fastq.gz

--fq2_suffix                    : The suffix of FASTQ files (Paired-end reads
                                  or R2 reads) if an input directory is
                                  mentioned via --input option. Default:
                                  _R2_001.fastq.gz

--fq_filter_by_len              : Remove FASTQ reads that are less than this
                                  many bases. Default: 75

--fq_strandedness               : The strandedness of the sequencing run.
                                  This is mostly needed if your sequencing
                                  run is RNA-SEQ. For most of the other runs,
                                  it is probably safe to use unstranded for
                                  the option. Default: unstranded

--fq_single_end                 : SINGLE-END information will be auto-
                                  detected but this option forces PAIRED-END
                                  FASTQ files to be treated as SINGLE-END so
                                  only read 1 information is included in auto-
                                  generated samplesheet. Default: false

--fq_filename_delim             : Delimiter by which the file name is split
                                  to obtain sample name. Default: _

--fq_filename_delim_idx         : After splitting FASTQ file name by using
                                  the --fq_filename_delim option, all
                                  elements before this index (1-based) will
                                  be joined to create final sample name.
                                  Default: 1

--seqkit_rmdup_run              : Remove duplicate sequences using seqkit
                                  rmdup. Default: false

--seqkit_rmdup_n                : Match and remove duplicate sequences by
                                  full name instead of just ID. Defaut: false

--seqkit_rmdup_s                : Match and remove duplicate sequences by
                                  sequence content. Defaut: true

--seqkit_rmdup_d                : Save the duplicated sequences to a file.
                                  Defaut: false

--seqkit_rmdup_D                : Save the number and list of duplicated
                                  sequences to a file. Defaut: false

--seqkit_rmdup_i                : Ignore case while using seqkit rmdup.
                                  Defaut: false

--seqkit_rmdup_P                : Only consider positive strand (i.e. 5')
                                  when comparing by sequence content. Defaut:
                                  false

--kraken2_db                    : Absolute path to kraken database. Default: /
                                  hpc/db/kraken2/standard-210914

--kraken2_confidence            : Confidence score threshold which must be
                                  between 0 and 1. Default: 0.0

--kraken2_quick                 : Quick operation (use first hit or hits).
                                  Default: false

--kraken2_use_mpa_style         : Report output like Kraken 1's kraken-mpa-
                                  report. Default: false

--kraken2_minimum_base_quality  : Minimum base quality used in classification
                                  which is only effective with FASTQ input.
                                  Default: 0

--kraken2_report_zero_counts    : Report counts for ALL taxa, even if counts
                                  are zero. Default: false

--kraken2_report_minmizer_data  : Report minimizer and distinct minimizer
                                  count information in addition to normal
                                  Kraken report. Default: false

--kraken2_use_names             : Print scientific names instead of just
                                  taxids. Default: true

--kraken2_extract_bug           : Extract the reads or contigs beloging to
                                  this bug. Default: Escherichia coli

--centrifuge_x                  : Absolute path to centrifuge database.
                                  Default: /hpc/db/centrifuge/2022-04-12/ab

--centrifuge_save_unaligned     : Save SINGLE-END reads that did not align.
                                  For PAIRED-END reads, save read pairs that
                                  did not align concordantly. Default: false

--centrifuge_save_aligned       : Save SINGLE-END reads that aligned. For
                                  PAIRED-END reads, save read pairs that
                                  aligned concordantly. Default: false

--centrifuge_out_fmt_sam        : Centrifuge output should be in SAM. Default:
                                  false

--centrifuge_extract_bug        : Extract this bug from centrifuge results.
                                  Default: Escherichia coli

--centrifuge_ignore_quals       : Treat all quality values as 30 on Phred
                                  scale. Default: false

--megahit_run                   : Run MEGAHIT assembler. Default: true

--megahit_min_count             : <int>. Minimum multiplicity for filtering (
                                  k_min+1)-mers. Defaut: false

--megahit_k_list                : Comma-separated list of kmer size. All
                                  values must be odd, in the range 15-255,
                                  increment should be <= 28. Ex: '21,29,39,59,
                                  79,99,119,141'. Default: false

--megahit_no_mercy              : Do not add mercy k-mers. Default: false

--megahit_bubble_level          : <int>. Intensity of bubble merging (0-2), 0
                                  to disable. Default: false

--megahit_merge_level           : <l,s>. Merge complex bubbles of length <= l*
                                  kmer_size and similarity >= s. Default:
                                  false

--megahit_prune_level           : <int>. Strength of low depth pruning (0-3).
                                  Default: false

--megahit_prune_depth           : <int>. Remove unitigs with avg k-mer depth
                                  less than this value. Default: false

--megahit_low_local_ratio       : <float>. Ratio threshold to define low
                                  local coverage contigs. Default: false

--megahit_max_tip_len           : <int>. remove tips less than this value [<
                                  int> * k]. Default: false

--megahit_no_local              : Disable local assembly. Default: false

--megahit_kmin_1pass            : Use 1pass mode to build SdBG of k_min.
                                  Default: false

--megahit_preset                : <str>. Override a group of parameters.
                                  Valid values are meta-sensitive which
                                  enforces '--min-count 1 --k-list 21,29,39,
                                  49,...,129,141', meta-large (large &
                                  complex metagenomes, like soil) which
                                  enforces '--k-min 27 --k-max 127 --k-step
                                  10'. Default: meta-sensitive

--megahit_mem_flag              : <int>. SdBG builder memory mode. 0: minimum;
                                  1: moderate; 2: use all memory specified.
                                  Default: 2

--megahit_min_contig_len        : <int>.  Minimum length of contigs to output.
                                  Default: false

--spades_run                    : Run SPAdes assembler. Default: false

--spades_isolate                : This flag is highly recommended for high-
                                  coverage isolate and multi-cell data.
                                  Defaut: false

--spades_sc                     : This flag is required for MDA (single-cell)
                                  data. Default: false

--spades_meta                   : This flag is required for metagenomic data.
                                  Default: true

--spades_bio                    : This flag is required for biosytheticSPAdes
                                  mode. Default: false

--spades_corona                 : This flag is required for coronaSPAdes mode.
                                  Default: false

--spades_rna                    : This flag is required for RNA-Seq data.
                                  Default: false

--spades_plasmid                : Runs plasmidSPAdes pipeline for plasmid
                                  detection. Default: false

--spades_metaviral              : Runs metaviralSPAdes pipeline for virus
                                  detection. Default: false

--spades_metaplasmid            : Runs metaplasmidSPAdes pipeline for plasmid
                                  detection in metagenomics datasets. Default:
                                  false

--spades_rnaviral               : This flag enables virus assembly module
                                  from RNA-Seq data. Default: false

--spades_iontorrent             : This flag is required for IonTorrent data.
                                  Default: false

--spades_only_assembler         : Runs only the SPAdes assembler module (
                                  without read error correction). Default:
                                  false

--spades_careful                : Tries to reduce the number of mismatches
                                  and short indels in the assembly. Default:
                                  false

--spades_cov_cutoff             : Coverage cutoff value (a positive float
                                  number). Default: false

--spades_k                      : List of k-mer sizes (must be odd and less
                                  than 128). Default: false

--spades_hmm                    : Directory with custom hmms that replace the
                                  default ones (very rare). Default: false

--serotypefinder_run            : Run SerotypeFinder tool. Default: true

--serotypefinder_x              : Generate extended output files. Default:
                                  true

--serotypefinder_db             : Path to SerotypeFinder databases. Default: /
                                  hpc/db/serotypefinder/2.0.2

--serotypefinder_min_threshold  : Minimum percent identity (in float)
                                  required for calling a hit. Default: 0.85

--serotypefinder_min_cov        : Minumum percent coverage (in float)
                                  required for calling a hit. Default: 0.80

--seqsero2_run                  : Run SeqSero2 tool. Default: false

--seqsero2_t                    : '1' for interleaved paired-end reads, '2'
                                  for separated paired-end reads, '3' for
                                  single reads, '4' for genome assembly, '5'
                                  for nanopore reads (fasta/fastq). Default:
                                  4

--seqsero2_m                    : Which workflow to apply, 'a'(raw reads
                                  allele micro-assembly), 'k'(raw reads and
                                  genome assembly k-mer). Default: k

--seqsero2_c                    : SeqSero2 will only output serotype
                                  prediction without the directory containing
                                  log files. Default: false

--seqsero2_s                    : SeqSero2 will not output header in
                                  SeqSero_result.tsv. Default: false

--mlst_run                      : Run MLST tool. Default: true

--mlst_minid                    : DNA %identity of full allelle to consider '
                                  similar' [~]. Default: 95

--mlst_mincov                   : DNA %cov to report partial allele at all [?].
                                  Default: 10

--mlst_minscore                 : Minumum score out of 100 to match a scheme.
                                  Default: 50

--abricate_run                  : Run ABRicate tool. Default: true

--abricate_minid                : Minimum DNA %identity. Defaut: 90

--abricate_mincov               : Minimum DNA %coverage. Defaut: 80

--abricate_datadir              : ABRicate databases folder. Defaut: /hpc/db/
                                  abricate/1.0.1/db

Help options                    :

--help                          : Display this message.
```

### **BETA**

---
The development of the modular structure and flow is an ongoing effort and may change depending on assessment of various computational topics and other considerations.
author	kkonganti
date	Wed, 03 Jul 2024 15:16:39 -0400
parents	52045ea4679d
children