kkonganti@17: # bettercallsal_db kkonganti@17: kkonganti@17: `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2876`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`). kkonganti@17: kkonganti@17: The `bettercallsal_db` workflow should finish within an hour with stable internet connection. kkonganti@17: kkonganti@17: \ kkonganti@17:   kkonganti@17: kkonganti@17: ## Workflow Usage kkonganti@17: kkonganti@17: ```bash kkonganti@17: cpipes --pipeline bettercallsal_db [options] kkonganti@17: ``` kkonganti@17: kkonganti@17: \ kkonganti@17:   kkonganti@17: kkonganti@17: Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876`. kkonganti@17: kkonganti@17: ```bash kkonganti@17: cpipes kkonganti@17: --pipeline bettercallsal_db \ kkonganti@17: --pdg_release PDG000000002.2876 \ kkonganti@17: --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876 kkonganti@17: ``` kkonganti@17: kkonganti@17: \ kkonganti@17:   kkonganti@17: kkonganti@17: Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option. kkonganti@17: kkonganti@17: ```bash kkonganti@17: cpipes kkonganti@17: --pipeline bettercallsal \ kkonganti@17: --input /path/to/illumina/fastq/dir \ kkonganti@17: --output /path/to/output \ kkonganti@17: --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876 kkonganti@17: ``` kkonganti@17: kkonganti@17: \ kkonganti@17:   kkonganti@17: kkonganti@17: ## Note kkonganti@17: kkonganti@17: Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting. kkonganti@17: kkonganti@17: \ kkonganti@17:   kkonganti@17: kkonganti@17: ## `bettercallsal_db` CLI Help kkonganti@17: kkonganti@17: ```text kkonganti@17: [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help kkonganti@17: N E X T F L O W ~ version 23.04.3 kkonganti@17: Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078 kkonganti@17: ================================================================================ kkonganti@17: (o) kkonganti@17: ___ _ __ _ _ __ ___ ___ kkonganti@17: / __|| '_ \ | || '_ \ / _ \/ __| kkonganti@17: | (__ | |_) || || |_) || __/\__ \ kkonganti@17: \___|| .__/ |_|| .__/ \___||___/ kkonganti@17: | | | | kkonganti@17: |_| |_| kkonganti@17: -------------------------------------------------------------------------------- kkonganti@17: A collection of modular pipelines at CFSAN, FDA. kkonganti@17: -------------------------------------------------------------------------------- kkonganti@17: Name : bettercallsal kkonganti@17: Author : Kranti Konganti kkonganti@17: Version : 0.7.0 kkonganti@17: Center : CFSAN, FDA. kkonganti@17: ================================================================================ kkonganti@17: kkonganti@17: Workflow : bettercallsal_db kkonganti@17: kkonganti@17: Author : Kranti Konganti kkonganti@17: kkonganti@17: Version : 0.7.0 kkonganti@17: kkonganti@17: kkonganti@17: Required : kkonganti@17: kkonganti@17: --output : Absolute path to directory where all the kkonganti@17: pipeline outputs should be stored. Ex: -- kkonganti@17: output /path/to/output kkonganti@17: kkonganti@17: Other options : kkonganti@17: kkonganti@17: --wcomp_serocol : Column number (non 0-based index) of the kkonganti@17: PDG metadata file by which the serotypes kkonganti@17: are collected. Default: false kkonganti@17: kkonganti@17: --wcomp_seronamecol : Column number (non 0-based index) of the kkonganti@17: PDG metadata file whose column name is " kkonganti@17: serovar". Default: false kkonganti@17: kkonganti@17: --wcomp_acc_col : Column number (non 0-based index) of the kkonganti@17: PDG metadata file whose column name is "acc kkonganti@17: ". Default: false kkonganti@17: kkonganti@17: --wcomp_target_acc_col : Column number (non 0-based index) of the kkonganti@17: PDG metadata file whose column name is " kkonganti@17: target_acc". Default: false kkonganti@17: kkonganti@17: --wcomp_complete_sero : Skip indexing serotypes when the serotype kkonganti@17: name in the column number 49 (non 0-based) kkonganti@17: of PDG metadata file consists a "-". For kkonganti@17: example, if an accession has a serotype= kkonganti@17: string as such in column number 49 (non 0- kkonganti@17: based): "serotype=- 13:z4,z23:-" then, the kkonganti@17: indexing of that accession is skipped. kkonganti@17: Default: false kkonganti@17: kkonganti@17: --wcomp_not_null_serovar : Only index the computed_serotype column i.e kkonganti@17: . column number 49 (non 0-based), if the kkonganti@17: serovar column is not NULL. Default: false kkonganti@17: kkonganti@17: --wcomp_i : Force include this serovar. Ignores -- kkonganti@17: wcomp_complete_sero for only this serovar. kkonganti@17: Mention multiple serovars separated by a kkonganti@17: ! (Exclamation mark). Ex: -- kkonganti@17: wcomp_complete_sero I 4,[5],12:i:-!Agona kkonganti@17: Default: false kkonganti@17: kkonganti@17: --wcomp_num : Number of genome accessions to be collected kkonganti@17: per serotype. Default: false kkonganti@17: kkonganti@17: --wcomp_min_contig_size : Minimum contig size to consider a genome kkonganti@17: for indexing. Default: false kkonganti@17: kkonganti@17: --wsnp_serocol : Column number (non 0-based index) of the kkonganti@17: PDG metadata file by which the serotypes kkonganti@17: are collected. Default: false kkonganti@17: kkonganti@17: --wsnp_seronamecol : Column number (non 0-based index) of the kkonganti@17: PDG metadata file whose column name is " kkonganti@17: serovar". Default: false kkonganti@17: kkonganti@17: --wsnp_acc_col : Column number (non 0-based index) of the kkonganti@17: PDG metadata file whose column name is "acc kkonganti@17: ". Default: false kkonganti@17: kkonganti@17: --wsnp_target_acc_col : Column number (non 0-based index) of the kkonganti@17: PDG metadata file whose column name is " kkonganti@17: target_acc". Default: false kkonganti@17: kkonganti@17: --wsnp_complete_sero : Skip indexing serotypes when the serotype kkonganti@17: name in the column number 49 (non 0-based) kkonganti@17: of PDG metadata file consists a "-". For kkonganti@17: example, if an accession has a serotype= kkonganti@17: string as such in column number 49 (non 0- kkonganti@17: based): "serotype=- 13:z4,z23:-" then, the kkonganti@17: indexing of that accession is skipped. kkonganti@17: Default: true kkonganti@17: kkonganti@17: --wsnp_not_null_serovar : Only index the computed_serotype column i.e kkonganti@17: . column number 49 (non 0-based), if the kkonganti@17: serovar column is not NULL. Default: false kkonganti@17: kkonganti@17: --wsnp_i : Force include this serovar. Ignores -- kkonganti@17: wsnp_complete_sero for only this serovar. kkonganti@17: Mention multiple serovars separated by a kkonganti@17: ! (Exclamation mark). Ex: -- kkonganti@17: wsnp_complete_sero I 4,[5],12:i:-!Agona kkonganti@17: Default: 'I 4,[5],12:i kkonganti@17: kkonganti@17: --wsnp_num : Number of genome accessions to collect per kkonganti@17: SNP cluster. Default: false kkonganti@17: kkonganti@17: --mashsketch_run : Run `mash screen` tool. Default: true kkonganti@17: kkonganti@17: --mashsketch_l : List input. Lines in each specify kkonganti@17: paths to sequence files, one per line. kkonganti@17: Default: true kkonganti@17: kkonganti@17: --mashsketch_I : ID field for sketch of reads ( kkonganti@17: instead of first sequence ID). Default: kkonganti@17: false kkonganti@17: kkonganti@17: --mashsketch_C : Comment for a sketch of reads ( kkonganti@17: instead of first sequence comment). Default kkonganti@17: : false kkonganti@17: kkonganti@17: --mashsketch_k : K-mer size. Hashes will be based on kkonganti@17: strings of this many nucleotides. kkonganti@17: Canonical nucleotides are used by default ( kkonganti@17: see Alphabet options below). (1-32) Default kkonganti@17: : 21 kkonganti@17: kkonganti@17: --mashsketch_s : Sketch size. Each sketch will have kkonganti@17: at most this many non-redundant min-hashes kkonganti@17: . Default: 1000 kkonganti@17: kkonganti@17: --mashsketch_i : Sketch individual sequences, rather than kkonganti@17: whole files, e.g. for multi-fastas of kkonganti@17: single-chromosome genomes or pair-wise gene kkonganti@17: comparisons. Default: false kkonganti@17: kkonganti@17: --mashsketch_S : Seed to provide to the hash kkonganti@17: function. (0-4294967296) [42] Default: kkonganti@17: false kkonganti@17: kkonganti@17: --mashsketch_w : Probability threshold for warning kkonganti@17: about low k-mer size. (0-1) Default: false kkonganti@17: kkonganti@17: --mashsketch_r : Input is a read set. See Reads options kkonganti@17: below. Incompatible with --mashsketch_i. kkonganti@17: Default: false kkonganti@17: kkonganti@17: --mashsketch_b : Use a Bloom filter of this size ( kkonganti@17: raw bytes or with K/M/G/T) to filter out kkonganti@17: unique k-mers. This is useful if exact kkonganti@17: filtering with --mashsketch_m uses too much kkonganti@17: memory. However, some unique k-mers may kkonganti@17: pass erroneously, and copies cannot be kkonganti@17: counted beyond 2. Implies --mashsketch_r. kkonganti@17: Default: false kkonganti@17: kkonganti@17: --mashsketch_m : Minimum copies of each k-mer kkonganti@17: required to pass noise filter for reads. kkonganti@17: Implies --mashsketch_r. Default: false kkonganti@17: kkonganti@17: --mashsketch_c : Target coverage. Sketching will kkonganti@17: conclude if this coverage is reached before kkonganti@17: the end of the input file (estimated by kkonganti@17: average k-mer multiplicity). Implies -- kkonganti@17: mashsketch_r. Default: false kkonganti@17: kkonganti@17: --mashsketch_g : Genome size (raw bases or with K/M/ kkonganti@17: G/T). If specified, will be used for p- kkonganti@17: value calculation instead of an estimated kkonganti@17: size from k-mer content. Implies -- kkonganti@17: mashsketch_r. Default: false kkonganti@17: kkonganti@17: --mashsketch_n : Preserve strand (by default, strand is kkonganti@17: ignored by using canonical DNA k-mers, kkonganti@17: which are alphabetical minima of forward- kkonganti@17: reverse pairs). Implied if an alphabet is kkonganti@17: specified with --mashsketch_a or -- kkonganti@17: mashsketch_z. Default: false kkonganti@17: kkonganti@17: --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ kkonganti@17: ). Implies --mashsketch_n --mashsketch_k 9 kkonganti@17: . Default: false kkonganti@17: kkonganti@17: --mashsketch_z : Alphabet to base hashes on (case kkonganti@17: ignored by default; see --mashsketch_Z). K- kkonganti@17: mers with other characters will be ignored kkonganti@17: . Implies --mashsketch_n. Default: false kkonganti@17: kkonganti@17: --mashsketch_Z : Preserve case in k-mers and alphabet (case kkonganti@17: is ignored by default). Sequence letters kkonganti@17: whose case is not in the current alphabet kkonganti@17: will be skipped when sketching. Default: kkonganti@17: false kkonganti@17: kkonganti@17: Help options : kkonganti@17: kkonganti@17: --help : Display this message. kkonganti@17: kkonganti@17: ```