view 0.7.0/readme/bettercallsal_db.md @ 21:4ce0e079377d tip

planemo upload
author kkonganti
date Mon, 15 Jul 2024 12:01:00 -0400
parents 0e7a0053e4a6
children
line wrap: on
line source
# bettercallsal_db

`bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2876`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).

The `bettercallsal_db` workflow should finish within an hour with stable internet connection.

\
 

## Workflow Usage

```bash
cpipes --pipeline bettercallsal_db [options]
```

\
 

Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876`.

```bash
cpipes
      --pipeline bettercallsal_db \
      --pdg_release PDG000000002.2876 \
      --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
```

\
 

Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.

```bash
cpipes
      --pipeline bettercallsal \
      --input /path/to/illumina/fastq/dir \
      --output /path/to/output \
      --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
```

\
 

## Note

Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.

\
 

## `bettercallsal_db` CLI Help

```text
[Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
N E X T F L O W  ~  version 23.04.3
Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078
================================================================================
             (o)                  
  ___  _ __   _  _ __    ___  ___ 
 / __|| '_ \ | || '_ \  / _ \/ __|
| (__ | |_) || || |_) ||  __/\__ \
 \___|| .__/ |_|| .__/  \___||___/
      | |       | |               
      |_|       |_|
--------------------------------------------------------------------------------
A collection of modular pipelines at CFSAN, FDA.
--------------------------------------------------------------------------------
Name                            : bettercallsal
Author                          : Kranti Konganti
Version                         : 0.7.0
Center                          : CFSAN, FDA.
================================================================================

Workflow                        : bettercallsal_db

Author                          : Kranti Konganti

Version                         : 0.7.0


Required                        : 

--output                        : Absolute path to directory where all the
                                  pipeline outputs should be stored. Ex: --
                                  output /path/to/output

Other options                   : 

--wcomp_serocol                 : Column number (non 0-based index) of the
                                  PDG metadata file by which the serotypes
                                  are collected. Default: false

--wcomp_seronamecol             : Column number (non 0-based index) of the
                                  PDG metadata file whose column name is "
                                  serovar".  Default: false

--wcomp_acc_col                 : Column number (non 0-based index) of the
                                  PDG metadata file whose column name is "acc
                                  ".  Default: false

--wcomp_target_acc_col          : Column number (non 0-based index) of the
                                  PDG metadata file whose column name is "
                                  target_acc".  Default: false

--wcomp_complete_sero           : Skip indexing serotypes when the serotype
                                  name in the column number 49 (non 0-based)
                                  of PDG metadata file consists a "-". For
                                  example, if an accession has a serotype=
                                  string as such in column number 49 (non 0-
                                  based): "serotype=- 13:z4,z23:-" then, the
                                  indexing of that accession is skipped.
                                  Default: false

--wcomp_not_null_serovar        : Only index the computed_serotype column i.e
                                  . column number 49 (non 0-based), if the
                                  serovar column is not NULL.  Default: false

--wcomp_i                       : Force include this serovar. Ignores --
                                  wcomp_complete_sero for only this serovar.
                                  Mention multiple serovars separated by a
                                  ! (Exclamation mark). Ex: --
                                  wcomp_complete_sero I 4,[5],12:i:-!Agona
                                  Default: false

--wcomp_num                     : Number of genome accessions to be collected
                                  per serotype. Default: false

--wcomp_min_contig_size         : Minimum contig size to consider a genome
                                  for indexing. Default: false

--wsnp_serocol                  : Column number (non 0-based index) of the
                                  PDG metadata file by which the serotypes
                                  are collected. Default: false

--wsnp_seronamecol              : Column number (non 0-based index) of the
                                  PDG metadata file whose column name is "
                                  serovar".  Default: false

--wsnp_acc_col                  : Column number (non 0-based index) of the
                                  PDG metadata file whose column name is "acc
                                  ".  Default: false

--wsnp_target_acc_col           : Column number (non 0-based index) of the
                                  PDG metadata file whose column name is "
                                  target_acc".  Default: false

--wsnp_complete_sero            : Skip indexing serotypes when the serotype
                                  name in the column number 49 (non 0-based)
                                  of PDG metadata file consists a "-". For
                                  example, if an accession has a serotype=
                                  string as such in column number 49 (non 0-
                                  based): "serotype=- 13:z4,z23:-" then, the
                                  indexing of that accession is skipped.
                                  Default: true

--wsnp_not_null_serovar         : Only index the computed_serotype column i.e
                                  . column number 49 (non 0-based), if the
                                  serovar column is not NULL.  Default: false

--wsnp_i                        : Force include this serovar. Ignores --
                                  wsnp_complete_sero for only this serovar.
                                  Mention multiple serovars separated by a
                                  ! (Exclamation mark). Ex: --
                                  wsnp_complete_sero I 4,[5],12:i:-!Agona
                                  Default: 'I 4,[5],12:i

--wsnp_num                      : Number of genome accessions to collect per
                                  SNP cluster. Default: false

--mashsketch_run                : Run `mash screen` tool. Default: true

--mashsketch_l                  : List input. Lines in each <input> specify
                                  paths to sequence files, one per line.
                                  Default: true

--mashsketch_I                  : <path>  ID field for sketch of reads (
                                  instead of first sequence ID). Default:
                                  false

--mashsketch_C                  : <path>  Comment for a sketch of reads (
                                  instead of first sequence comment). Default
                                  : false

--mashsketch_k                  : <int>   K-mer size. Hashes will be based on
                                  strings of this many nucleotides.
                                  Canonical nucleotides are used by default (
                                  see Alphabet options below). (1-32) Default
                                  : 21

--mashsketch_s                  : <int>   Sketch size. Each sketch will have
                                  at most this many non-redundant min-hashes
                                  . Default: 1000

--mashsketch_i                  : Sketch individual sequences, rather than
                                  whole files, e.g. for multi-fastas of
                                  single-chromosome genomes or pair-wise gene
                                  comparisons. Default: false

--mashsketch_S                  : <int>   Seed to provide to the hash
                                  function. (0-4294967296) [42] Default:
                                  false

--mashsketch_w                  : <num>   Probability threshold for warning
                                  about low k-mer size. (0-1) Default: false

--mashsketch_r                  : Input is a read set. See Reads options
                                  below. Incompatible with --mashsketch_i.
                                  Default: false

--mashsketch_b                  : <size>  Use a Bloom filter of this size (
                                  raw bytes or with K/M/G/T) to filter out
                                  unique k-mers. This is useful if exact
                                  filtering with --mashsketch_m uses too much
                                  memory. However, some unique k-mers may
                                  pass erroneously, and copies cannot be
                                  counted beyond 2. Implies --mashsketch_r.
                                  Default: false

--mashsketch_m                  : <int>   Minimum copies of each k-mer
                                  required to pass noise filter for reads.
                                  Implies --mashsketch_r. Default: false

--mashsketch_c                  : <num>   Target coverage. Sketching will
                                  conclude if this coverage is reached before
                                  the end of the input file (estimated by
                                  average k-mer multiplicity). Implies --
                                  mashsketch_r. Default: false

--mashsketch_g                  : <size>  Genome size (raw bases or with K/M/
                                  G/T). If specified, will be used for p-
                                  value calculation instead of an estimated
                                  size from k-mer content. Implies --
                                  mashsketch_r. Default: false

--mashsketch_n                  : Preserve strand (by default, strand is
                                  ignored by using canonical DNA k-mers,
                                  which are alphabetical minima of forward-
                                  reverse pairs). Implied if an alphabet is
                                  specified with --mashsketch_a or --
                                  mashsketch_z. Default: false

--mashsketch_a                  : Use amino acid alphabet (A-Z, except BJOUXZ
                                  ). Implies --mashsketch_n --mashsketch_k 9
                                  . Default: false

--mashsketch_z                  : <text>  Alphabet to base hashes on (case
                                  ignored by default; see --mashsketch_Z). K-
                                  mers with other characters will be ignored
                                  . Implies --mashsketch_n. Default: false

--mashsketch_Z                  : Preserve case in k-mers and alphabet (case
                                  is ignored by default). Sequence letters
                                  whose case is not in the current alphabet
                                  will be skipped when sketching. Default:
                                  false

Help options                    : 

--help                          : Display this message.

```