annotate 0.5.0/readme/bettercallsal_db.md @ 1:365849f031fd

"planemo upload"
author kkonganti
date Mon, 05 Jun 2023 18:48:51 -0400
parents
children
rev   line source
kkonganti@1 1 # bettercallsal_db
kkonganti@1 2
kkonganti@1 3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2537`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
kkonganti@1 4
kkonganti@1 5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
kkonganti@1 6
kkonganti@1 7 \
kkonganti@1 8  
kkonganti@1 9
kkonganti@1 10 ## Workflow Usage
kkonganti@1 11
kkonganti@1 12 ```bash
kkonganti@1 13 cpipes --pipeline bettercallsal_db [options]
kkonganti@1 14 ```
kkonganti@1 15
kkonganti@1 16 \
kkonganti@1 17  
kkonganti@1 18
kkonganti@1 19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db`.
kkonganti@1 20
kkonganti@1 21 ```bash
kkonganti@1 22 cpipes
kkonganti@1 23 --pipeline bettercallsal_db \
kkonganti@1 24 --pdg_release PDG000000002.2537 \
kkonganti@1 25 --output /data/Kranti_Konganti/bettercallsal_db
kkonganti@1 26 ```
kkonganti@1 27
kkonganti@1 28 \
kkonganti@1 29  
kkonganti@1 30
kkonganti@1 31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
kkonganti@1 32
kkonganti@1 33 ```bash
kkonganti@1 34 cpipes
kkonganti@1 35 --pipeline bettercallsal \
kkonganti@1 36 --input /path/to/illumina/fastq/dir \
kkonganti@1 37 --output /path/to/output \
kkonganti@1 38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db
kkonganti@1 39 ```
kkonganti@1 40
kkonganti@1 41 \
kkonganti@1 42  
kkonganti@1 43
kkonganti@1 44 ## Note
kkonganti@1 45
kkonganti@1 46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
kkonganti@1 47
kkonganti@1 48 \
kkonganti@1 49  
kkonganti@1 50
kkonganti@1 51 ## `bettercallsal_db` CLI Help
kkonganti@1 52
kkonganti@1 53 ```text
kkonganti@1 54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
kkonganti@1 55 N E X T F L O W ~ version 22.10.0
kkonganti@1 56 Launching `./bettercallsal/cpipes` [hopeful_franklin] DSL2 - revision: 93f5293f50
kkonganti@1 57 ================================================================================
kkonganti@1 58 (o)
kkonganti@1 59 ___ _ __ _ _ __ ___ ___
kkonganti@1 60 / __|| '_ \ | || '_ \ / _ \/ __|
kkonganti@1 61 | (__ | |_) || || |_) || __/\__ \
kkonganti@1 62 \___|| .__/ |_|| .__/ \___||___/
kkonganti@1 63 | | | |
kkonganti@1 64 |_| |_|
kkonganti@1 65 --------------------------------------------------------------------------------
kkonganti@1 66 A collection of modular pipelines at CFSAN, FDA.
kkonganti@1 67 --------------------------------------------------------------------------------
kkonganti@1 68 Name : CPIPES
kkonganti@1 69 Author : Kranti Konganti
kkonganti@1 70 Version : 0.5.0
kkonganti@1 71 Center : CFSAN, FDA.
kkonganti@1 72 ================================================================================
kkonganti@1 73
kkonganti@1 74 Workflow : bettercallsal_db
kkonganti@1 75
kkonganti@1 76 Author : Kranti Konganti
kkonganti@1 77
kkonganti@1 78 Version : 0.4.0
kkonganti@1 79
kkonganti@1 80
kkonganti@1 81 Required :
kkonganti@1 82
kkonganti@1 83 --output : Absolute path to directory where all the
kkonganti@1 84 pipeline outputs should be stored. Ex: --
kkonganti@1 85 output /path/to/output
kkonganti@1 86
kkonganti@1 87 Other options :
kkonganti@1 88
kkonganti@1 89 --wcomp_serocol : Column number (non 0-based index) of the
kkonganti@1 90 PDG metadata file by which the serotypes
kkonganti@1 91 are collected. Default: false
kkonganti@1 92
kkonganti@1 93 --wcomp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@1 94 name in the column number 49 (non 0-based)
kkonganti@1 95 of PDG metadata file consists a "-". For
kkonganti@1 96 example, if an accession has a serotype=
kkonganti@1 97 string as such in column number 49 (non 0-
kkonganti@1 98 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@1 99 indexing of that accession is skipped.
kkonganti@1 100 Default: false
kkonganti@1 101
kkonganti@1 102 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@1 103 . column number 49 (non 0-based), if the
kkonganti@1 104 serovar column is not NULL. Default: false
kkonganti@1 105
kkonganti@1 106 --wcomp_i : Force include this serovar. Ignores --
kkonganti@1 107 wcomp_complete_sero for only this serovar.
kkonganti@1 108 Mention multiple serovars separated by a
kkonganti@1 109 ! (Exclamation mark). Ex: --
kkonganti@1 110 wcomp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@1 111 Default: false
kkonganti@1 112
kkonganti@1 113 --wcomp_num : Number of genome accessions to be collected
kkonganti@1 114 per serotype. Default: false
kkonganti@1 115
kkonganti@1 116 --wcomp_min_contig_size : Minimum contig size to consider a genome
kkonganti@1 117 for indexing. Default: false
kkonganti@1 118
kkonganti@1 119 --wsnp_serocol : Column number (non 0-based index) of the
kkonganti@1 120 PDG metadata file by which the serotypes
kkonganti@1 121 are collected. Default: false
kkonganti@1 122
kkonganti@1 123 --wsnp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@1 124 name in the column number 49 (non 0-based)
kkonganti@1 125 of PDG metadata file consists a "-". For
kkonganti@1 126 example, if an accession has a serotype=
kkonganti@1 127 string as such in column number 49 (non 0-
kkonganti@1 128 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@1 129 indexing of that accession is skipped.
kkonganti@1 130 Default: true
kkonganti@1 131
kkonganti@1 132 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@1 133 . column number 49 (non 0-based), if the
kkonganti@1 134 serovar column is not NULL. Default: false
kkonganti@1 135
kkonganti@1 136 --wsnp_i : Force include this serovar. Ignores --
kkonganti@1 137 wsnp_complete_sero for only this serovar.
kkonganti@1 138 Mention multiple serovars separated by a
kkonganti@1 139 ! (Exclamation mark). Ex: --
kkonganti@1 140 wsnp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@1 141 Default: 'I 4,[5],12:i
kkonganti@1 142
kkonganti@1 143 --wsnp_num : Number of genome accessions to collect per
kkonganti@1 144 SNP cluster. Default: false
kkonganti@1 145
kkonganti@1 146 --mashsketch_run : Run `mash screen` tool. Default: true
kkonganti@1 147
kkonganti@1 148 --mashsketch_l : List input. Lines in each <input> specify
kkonganti@1 149 paths to sequence files, one per line.
kkonganti@1 150 Default: true
kkonganti@1 151
kkonganti@1 152 --mashsketch_I : <path> ID field for sketch of reads (
kkonganti@1 153 instead of first sequence ID). Default:
kkonganti@1 154 false
kkonganti@1 155
kkonganti@1 156 --mashsketch_C : <path> Comment for a sketch of reads (
kkonganti@1 157 instead of first sequence comment). Default
kkonganti@1 158 : false
kkonganti@1 159
kkonganti@1 160 --mashsketch_k : <int> K-mer size. Hashes will be based on
kkonganti@1 161 strings of this many nucleotides.
kkonganti@1 162 Canonical nucleotides are used by default (
kkonganti@1 163 see Alphabet options below). (1-32) Default
kkonganti@1 164 : 21
kkonganti@1 165
kkonganti@1 166 --mashsketch_s : <int> Sketch size. Each sketch will have
kkonganti@1 167 at most this many non-redundant min-hashes
kkonganti@1 168 . Default: 1000
kkonganti@1 169
kkonganti@1 170 --mashsketch_i : Sketch individual sequences, rather than
kkonganti@1 171 whole files, e.g. for multi-fastas of
kkonganti@1 172 single-chromosome genomes or pair-wise gene
kkonganti@1 173 comparisons. Default: false
kkonganti@1 174
kkonganti@1 175 --mashsketch_S : <int> Seed to provide to the hash
kkonganti@1 176 function. (0-4294967296) [42] Default:
kkonganti@1 177 false
kkonganti@1 178
kkonganti@1 179 --mashsketch_w : <num> Probability threshold for warning
kkonganti@1 180 about low k-mer size. (0-1) Default: false
kkonganti@1 181
kkonganti@1 182 --mashsketch_r : Input is a read set. See Reads options
kkonganti@1 183 below. Incompatible with --mashsketch_i.
kkonganti@1 184 Default: false
kkonganti@1 185
kkonganti@1 186 --mashsketch_b : <size> Use a Bloom filter of this size (
kkonganti@1 187 raw bytes or with K/M/G/T) to filter out
kkonganti@1 188 unique k-mers. This is useful if exact
kkonganti@1 189 filtering with --mashsketch_m uses too much
kkonganti@1 190 memory. However, some unique k-mers may
kkonganti@1 191 pass erroneously, and copies cannot be
kkonganti@1 192 counted beyond 2. Implies --mashsketch_r.
kkonganti@1 193 Default: false
kkonganti@1 194
kkonganti@1 195 --mashsketch_m : <int> Minimum copies of each k-mer
kkonganti@1 196 required to pass noise filter for reads.
kkonganti@1 197 Implies --mashsketch_r. Default: false
kkonganti@1 198
kkonganti@1 199 --mashsketch_c : <num> Target coverage. Sketching will
kkonganti@1 200 conclude if this coverage is reached before
kkonganti@1 201 the end of the input file (estimated by
kkonganti@1 202 average k-mer multiplicity). Implies --
kkonganti@1 203 mashsketch_r. Default: false
kkonganti@1 204
kkonganti@1 205 --mashsketch_g : <size> Genome size (raw bases or with K/M/
kkonganti@1 206 G/T). If specified, will be used for p-
kkonganti@1 207 value calculation instead of an estimated
kkonganti@1 208 size from k-mer content. Implies --
kkonganti@1 209 mashsketch_r. Default: false
kkonganti@1 210
kkonganti@1 211 --mashsketch_n : Preserve strand (by default, strand is
kkonganti@1 212 ignored by using canonical DNA k-mers,
kkonganti@1 213 which are alphabetical minima of forward-
kkonganti@1 214 reverse pairs). Implied if an alphabet is
kkonganti@1 215 specified with --mashsketch_a or --
kkonganti@1 216 mashsketch_z. Default: false
kkonganti@1 217
kkonganti@1 218 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
kkonganti@1 219 ). Implies --mashsketch_n --mashsketch_k 9
kkonganti@1 220 . Default: false
kkonganti@1 221
kkonganti@1 222 --mashsketch_z : <text> Alphabet to base hashes on (case
kkonganti@1 223 ignored by default; see --mashsketch_Z). K-
kkonganti@1 224 mers with other characters will be ignored
kkonganti@1 225 . Implies --mashsketch_n. Default: false
kkonganti@1 226
kkonganti@1 227 --mashsketch_Z : Preserve case in k-mers and alphabet (case
kkonganti@1 228 is ignored by default). Sequence letters
kkonganti@1 229 whose case is not in the current alphabet
kkonganti@1 230 will be skipped when sketching. Default:
kkonganti@1 231 false
kkonganti@1 232
kkonganti@1 233 Help options :
kkonganti@1 234
kkonganti@1 235 --help : Display this message.
kkonganti@1 236
kkonganti@1 237 ```