kkonganti@1: # bettercallsal_db
kkonganti@1: 
kkonganti@1: `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2537`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
kkonganti@1: 
kkonganti@1: The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
kkonganti@1: 
kkonganti@1: \
kkonganti@1: &nbsp;
kkonganti@1: 
kkonganti@1: ## Workflow Usage
kkonganti@1: 
kkonganti@1: ```bash
kkonganti@1: cpipes --pipeline bettercallsal_db [options]
kkonganti@1: ```
kkonganti@1: 
kkonganti@1: \
kkonganti@1: &nbsp;
kkonganti@1: 
kkonganti@1: Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db`.
kkonganti@1: 
kkonganti@1: ```bash
kkonganti@1: cpipes
kkonganti@1:       --pipeline bettercallsal_db \
kkonganti@1:       --pdg_release PDG000000002.2537 \
kkonganti@1:       --output /data/Kranti_Konganti/bettercallsal_db
kkonganti@1: ```
kkonganti@1: 
kkonganti@1: \
kkonganti@1: &nbsp;
kkonganti@1: 
kkonganti@1: Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
kkonganti@1: 
kkonganti@1: ```bash
kkonganti@1: cpipes
kkonganti@1:       --pipeline bettercallsal \
kkonganti@1:       --input /path/to/illumina/fastq/dir \
kkonganti@1:       --output /path/to/output \
kkonganti@1:       --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db
kkonganti@1: ```
kkonganti@1: 
kkonganti@1: \
kkonganti@1: &nbsp;
kkonganti@1: 
kkonganti@1: ## Note
kkonganti@1: 
kkonganti@1: Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
kkonganti@1: 
kkonganti@1: \
kkonganti@1: &nbsp;
kkonganti@1: 
kkonganti@1: ## `bettercallsal_db` CLI Help
kkonganti@1: 
kkonganti@1: ```text
kkonganti@1: [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
kkonganti@1: N E X T F L O W  ~  version 22.10.0
kkonganti@1: Launching `./bettercallsal/cpipes` [hopeful_franklin] DSL2 - revision: 93f5293f50
kkonganti@1: ================================================================================
kkonganti@1:              (o)
kkonganti@1:   ___  _ __   _  _ __    ___  ___
kkonganti@1:  / __|| '_ \ | || '_ \  / _ \/ __|
kkonganti@1: | (__ | |_) || || |_) ||  __/\__ \
kkonganti@1:  \___|| .__/ |_|| .__/  \___||___/
kkonganti@1:       | |       | |
kkonganti@1:       |_|       |_|
kkonganti@1: --------------------------------------------------------------------------------
kkonganti@1: A collection of modular pipelines at CFSAN, FDA.
kkonganti@1: --------------------------------------------------------------------------------
kkonganti@1: Name                            : CPIPES
kkonganti@1: Author                          : Kranti Konganti
kkonganti@1: Version                         : 0.5.0
kkonganti@1: Center                          : CFSAN, FDA.
kkonganti@1: ================================================================================
kkonganti@1: 
kkonganti@1: Workflow                        : bettercallsal_db
kkonganti@1: 
kkonganti@1: Author                          : Kranti Konganti
kkonganti@1: 
kkonganti@1: Version                         : 0.4.0
kkonganti@1: 
kkonganti@1: 
kkonganti@1: Required                        :
kkonganti@1: 
kkonganti@1: --output                        : Absolute path to directory where all the
kkonganti@1:                                   pipeline outputs should be stored. Ex: --
kkonganti@1:                                   output /path/to/output
kkonganti@1: 
kkonganti@1: Other options                   :
kkonganti@1: 
kkonganti@1: --wcomp_serocol                 : Column number (non 0-based index) of the
kkonganti@1:                                   PDG metadata file by which the serotypes
kkonganti@1:                                   are collected. Default: false
kkonganti@1: 
kkonganti@1: --wcomp_complete_sero           : Skip indexing serotypes when the serotype
kkonganti@1:                                   name in the column number 49 (non 0-based)
kkonganti@1:                                   of PDG metadata file consists a "-". For
kkonganti@1:                                   example, if an accession has a serotype=
kkonganti@1:                                   string as such in column number 49 (non 0-
kkonganti@1:                                   based): "serotype=- 13:z4,z23:-" then, the
kkonganti@1:                                   indexing of that accession is skipped.
kkonganti@1:                                   Default: false
kkonganti@1: 
kkonganti@1: --wcomp_not_null_serovar        : Only index the computed_serotype column i.e
kkonganti@1:                                   . column number 49 (non 0-based), if the
kkonganti@1:                                   serovar column is not NULL.  Default: false
kkonganti@1: 
kkonganti@1: --wcomp_i                       : Force include this serovar. Ignores --
kkonganti@1:                                   wcomp_complete_sero for only this serovar.
kkonganti@1:                                   Mention multiple serovars separated by a
kkonganti@1:                                   ! (Exclamation mark). Ex: --
kkonganti@1:                                   wcomp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@1:                                   Default: false
kkonganti@1: 
kkonganti@1: --wcomp_num                     : Number of genome accessions to be collected
kkonganti@1:                                   per serotype. Default: false
kkonganti@1: 
kkonganti@1: --wcomp_min_contig_size         : Minimum contig size to consider a genome
kkonganti@1:                                   for indexing. Default: false
kkonganti@1: 
kkonganti@1: --wsnp_serocol                  : Column number (non 0-based index) of the
kkonganti@1:                                   PDG metadata file by which the serotypes
kkonganti@1:                                   are collected. Default: false
kkonganti@1: 
kkonganti@1: --wsnp_complete_sero            : Skip indexing serotypes when the serotype
kkonganti@1:                                   name in the column number 49 (non 0-based)
kkonganti@1:                                   of PDG metadata file consists a "-". For
kkonganti@1:                                   example, if an accession has a serotype=
kkonganti@1:                                   string as such in column number 49 (non 0-
kkonganti@1:                                   based): "serotype=- 13:z4,z23:-" then, the
kkonganti@1:                                   indexing of that accession is skipped.
kkonganti@1:                                   Default: true
kkonganti@1: 
kkonganti@1: --wsnp_not_null_serovar         : Only index the computed_serotype column i.e
kkonganti@1:                                   . column number 49 (non 0-based), if the
kkonganti@1:                                   serovar column is not NULL.  Default: false
kkonganti@1: 
kkonganti@1: --wsnp_i                        : Force include this serovar. Ignores --
kkonganti@1:                                   wsnp_complete_sero for only this serovar.
kkonganti@1:                                   Mention multiple serovars separated by a
kkonganti@1:                                   ! (Exclamation mark). Ex: --
kkonganti@1:                                   wsnp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@1:                                   Default: 'I 4,[5],12:i
kkonganti@1: 
kkonganti@1: --wsnp_num                      : Number of genome accessions to collect per
kkonganti@1:                                   SNP cluster. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_run                : Run `mash screen` tool. Default: true
kkonganti@1: 
kkonganti@1: --mashsketch_l                  : List input. Lines in each <input> specify
kkonganti@1:                                   paths to sequence files, one per line.
kkonganti@1:                                   Default: true
kkonganti@1: 
kkonganti@1: --mashsketch_I                  : <path>  ID field for sketch of reads (
kkonganti@1:                                   instead of first sequence ID). Default:
kkonganti@1:                                   false
kkonganti@1: 
kkonganti@1: --mashsketch_C                  : <path>  Comment for a sketch of reads (
kkonganti@1:                                   instead of first sequence comment). Default
kkonganti@1:                                   : false
kkonganti@1: 
kkonganti@1: --mashsketch_k                  : <int>   K-mer size. Hashes will be based on
kkonganti@1:                                   strings of this many nucleotides.
kkonganti@1:                                   Canonical nucleotides are used by default (
kkonganti@1:                                   see Alphabet options below). (1-32) Default
kkonganti@1:                                   : 21
kkonganti@1: 
kkonganti@1: --mashsketch_s                  : <int>   Sketch size. Each sketch will have
kkonganti@1:                                   at most this many non-redundant min-hashes
kkonganti@1:                                   . Default: 1000
kkonganti@1: 
kkonganti@1: --mashsketch_i                  : Sketch individual sequences, rather than
kkonganti@1:                                   whole files, e.g. for multi-fastas of
kkonganti@1:                                   single-chromosome genomes or pair-wise gene
kkonganti@1:                                   comparisons. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_S                  : <int>   Seed to provide to the hash
kkonganti@1:                                   function. (0-4294967296) [42] Default:
kkonganti@1:                                   false
kkonganti@1: 
kkonganti@1: --mashsketch_w                  : <num>   Probability threshold for warning
kkonganti@1:                                   about low k-mer size. (0-1) Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_r                  : Input is a read set. See Reads options
kkonganti@1:                                   below. Incompatible with --mashsketch_i.
kkonganti@1:                                   Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_b                  : <size>  Use a Bloom filter of this size (
kkonganti@1:                                   raw bytes or with K/M/G/T) to filter out
kkonganti@1:                                   unique k-mers. This is useful if exact
kkonganti@1:                                   filtering with --mashsketch_m uses too much
kkonganti@1:                                   memory. However, some unique k-mers may
kkonganti@1:                                   pass erroneously, and copies cannot be
kkonganti@1:                                   counted beyond 2. Implies --mashsketch_r.
kkonganti@1:                                   Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_m                  : <int>   Minimum copies of each k-mer
kkonganti@1:                                   required to pass noise filter for reads.
kkonganti@1:                                   Implies --mashsketch_r. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_c                  : <num>   Target coverage. Sketching will
kkonganti@1:                                   conclude if this coverage is reached before
kkonganti@1:                                   the end of the input file (estimated by
kkonganti@1:                                   average k-mer multiplicity). Implies --
kkonganti@1:                                   mashsketch_r. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_g                  : <size>  Genome size (raw bases or with K/M/
kkonganti@1:                                   G/T). If specified, will be used for p-
kkonganti@1:                                   value calculation instead of an estimated
kkonganti@1:                                   size from k-mer content. Implies --
kkonganti@1:                                   mashsketch_r. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_n                  : Preserve strand (by default, strand is
kkonganti@1:                                   ignored by using canonical DNA k-mers,
kkonganti@1:                                   which are alphabetical minima of forward-
kkonganti@1:                                   reverse pairs). Implied if an alphabet is
kkonganti@1:                                   specified with --mashsketch_a or --
kkonganti@1:                                   mashsketch_z. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_a                  : Use amino acid alphabet (A-Z, except BJOUXZ
kkonganti@1:                                   ). Implies --mashsketch_n --mashsketch_k 9
kkonganti@1:                                   . Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_z                  : <text>  Alphabet to base hashes on (case
kkonganti@1:                                   ignored by default; see --mashsketch_Z). K-
kkonganti@1:                                   mers with other characters will be ignored
kkonganti@1:                                   . Implies --mashsketch_n. Default: false
kkonganti@1: 
kkonganti@1: --mashsketch_Z                  : Preserve case in k-mers and alphabet (case
kkonganti@1:                                   is ignored by default). Sequence letters
kkonganti@1:                                   whose case is not in the current alphabet
kkonganti@1:                                   will be skipped when sketching. Default:
kkonganti@1:                                   false
kkonganti@1: 
kkonganti@1: Help options                    :
kkonganti@1: 
kkonganti@1: --help                          : Display this message.
kkonganti@1: 
kkonganti@1: ```