annotate 0.7.0/readme/bettercallsal_db.md @ 18:75558ffe3e68

planemo upload
author kkonganti
date Mon, 15 Jul 2024 11:01:13 -0400
parents 0e7a0053e4a6
children
rev   line source
kkonganti@17 1 # bettercallsal_db
kkonganti@17 2
kkonganti@17 3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2876`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
kkonganti@17 4
kkonganti@17 5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
kkonganti@17 6
kkonganti@17 7 \
kkonganti@17 8  
kkonganti@17 9
kkonganti@17 10 ## Workflow Usage
kkonganti@17 11
kkonganti@17 12 ```bash
kkonganti@17 13 cpipes --pipeline bettercallsal_db [options]
kkonganti@17 14 ```
kkonganti@17 15
kkonganti@17 16 \
kkonganti@17 17  
kkonganti@17 18
kkonganti@17 19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876`.
kkonganti@17 20
kkonganti@17 21 ```bash
kkonganti@17 22 cpipes
kkonganti@17 23 --pipeline bettercallsal_db \
kkonganti@17 24 --pdg_release PDG000000002.2876 \
kkonganti@17 25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
kkonganti@17 26 ```
kkonganti@17 27
kkonganti@17 28 \
kkonganti@17 29  
kkonganti@17 30
kkonganti@17 31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
kkonganti@17 32
kkonganti@17 33 ```bash
kkonganti@17 34 cpipes
kkonganti@17 35 --pipeline bettercallsal \
kkonganti@17 36 --input /path/to/illumina/fastq/dir \
kkonganti@17 37 --output /path/to/output \
kkonganti@17 38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
kkonganti@17 39 ```
kkonganti@17 40
kkonganti@17 41 \
kkonganti@17 42  
kkonganti@17 43
kkonganti@17 44 ## Note
kkonganti@17 45
kkonganti@17 46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
kkonganti@17 47
kkonganti@17 48 \
kkonganti@17 49  
kkonganti@17 50
kkonganti@17 51 ## `bettercallsal_db` CLI Help
kkonganti@17 52
kkonganti@17 53 ```text
kkonganti@17 54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
kkonganti@17 55 N E X T F L O W ~ version 23.04.3
kkonganti@17 56 Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078
kkonganti@17 57 ================================================================================
kkonganti@17 58 (o)
kkonganti@17 59 ___ _ __ _ _ __ ___ ___
kkonganti@17 60 / __|| '_ \ | || '_ \ / _ \/ __|
kkonganti@17 61 | (__ | |_) || || |_) || __/\__ \
kkonganti@17 62 \___|| .__/ |_|| .__/ \___||___/
kkonganti@17 63 | | | |
kkonganti@17 64 |_| |_|
kkonganti@17 65 --------------------------------------------------------------------------------
kkonganti@17 66 A collection of modular pipelines at CFSAN, FDA.
kkonganti@17 67 --------------------------------------------------------------------------------
kkonganti@17 68 Name : bettercallsal
kkonganti@17 69 Author : Kranti Konganti
kkonganti@17 70 Version : 0.7.0
kkonganti@17 71 Center : CFSAN, FDA.
kkonganti@17 72 ================================================================================
kkonganti@17 73
kkonganti@17 74 Workflow : bettercallsal_db
kkonganti@17 75
kkonganti@17 76 Author : Kranti Konganti
kkonganti@17 77
kkonganti@17 78 Version : 0.7.0
kkonganti@17 79
kkonganti@17 80
kkonganti@17 81 Required :
kkonganti@17 82
kkonganti@17 83 --output : Absolute path to directory where all the
kkonganti@17 84 pipeline outputs should be stored. Ex: --
kkonganti@17 85 output /path/to/output
kkonganti@17 86
kkonganti@17 87 Other options :
kkonganti@17 88
kkonganti@17 89 --wcomp_serocol : Column number (non 0-based index) of the
kkonganti@17 90 PDG metadata file by which the serotypes
kkonganti@17 91 are collected. Default: false
kkonganti@17 92
kkonganti@17 93 --wcomp_seronamecol : Column number (non 0-based index) of the
kkonganti@17 94 PDG metadata file whose column name is "
kkonganti@17 95 serovar". Default: false
kkonganti@17 96
kkonganti@17 97 --wcomp_acc_col : Column number (non 0-based index) of the
kkonganti@17 98 PDG metadata file whose column name is "acc
kkonganti@17 99 ". Default: false
kkonganti@17 100
kkonganti@17 101 --wcomp_target_acc_col : Column number (non 0-based index) of the
kkonganti@17 102 PDG metadata file whose column name is "
kkonganti@17 103 target_acc". Default: false
kkonganti@17 104
kkonganti@17 105 --wcomp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@17 106 name in the column number 49 (non 0-based)
kkonganti@17 107 of PDG metadata file consists a "-". For
kkonganti@17 108 example, if an accession has a serotype=
kkonganti@17 109 string as such in column number 49 (non 0-
kkonganti@17 110 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@17 111 indexing of that accession is skipped.
kkonganti@17 112 Default: false
kkonganti@17 113
kkonganti@17 114 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@17 115 . column number 49 (non 0-based), if the
kkonganti@17 116 serovar column is not NULL. Default: false
kkonganti@17 117
kkonganti@17 118 --wcomp_i : Force include this serovar. Ignores --
kkonganti@17 119 wcomp_complete_sero for only this serovar.
kkonganti@17 120 Mention multiple serovars separated by a
kkonganti@17 121 ! (Exclamation mark). Ex: --
kkonganti@17 122 wcomp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@17 123 Default: false
kkonganti@17 124
kkonganti@17 125 --wcomp_num : Number of genome accessions to be collected
kkonganti@17 126 per serotype. Default: false
kkonganti@17 127
kkonganti@17 128 --wcomp_min_contig_size : Minimum contig size to consider a genome
kkonganti@17 129 for indexing. Default: false
kkonganti@17 130
kkonganti@17 131 --wsnp_serocol : Column number (non 0-based index) of the
kkonganti@17 132 PDG metadata file by which the serotypes
kkonganti@17 133 are collected. Default: false
kkonganti@17 134
kkonganti@17 135 --wsnp_seronamecol : Column number (non 0-based index) of the
kkonganti@17 136 PDG metadata file whose column name is "
kkonganti@17 137 serovar". Default: false
kkonganti@17 138
kkonganti@17 139 --wsnp_acc_col : Column number (non 0-based index) of the
kkonganti@17 140 PDG metadata file whose column name is "acc
kkonganti@17 141 ". Default: false
kkonganti@17 142
kkonganti@17 143 --wsnp_target_acc_col : Column number (non 0-based index) of the
kkonganti@17 144 PDG metadata file whose column name is "
kkonganti@17 145 target_acc". Default: false
kkonganti@17 146
kkonganti@17 147 --wsnp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@17 148 name in the column number 49 (non 0-based)
kkonganti@17 149 of PDG metadata file consists a "-". For
kkonganti@17 150 example, if an accession has a serotype=
kkonganti@17 151 string as such in column number 49 (non 0-
kkonganti@17 152 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@17 153 indexing of that accession is skipped.
kkonganti@17 154 Default: true
kkonganti@17 155
kkonganti@17 156 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@17 157 . column number 49 (non 0-based), if the
kkonganti@17 158 serovar column is not NULL. Default: false
kkonganti@17 159
kkonganti@17 160 --wsnp_i : Force include this serovar. Ignores --
kkonganti@17 161 wsnp_complete_sero for only this serovar.
kkonganti@17 162 Mention multiple serovars separated by a
kkonganti@17 163 ! (Exclamation mark). Ex: --
kkonganti@17 164 wsnp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@17 165 Default: 'I 4,[5],12:i
kkonganti@17 166
kkonganti@17 167 --wsnp_num : Number of genome accessions to collect per
kkonganti@17 168 SNP cluster. Default: false
kkonganti@17 169
kkonganti@17 170 --mashsketch_run : Run `mash screen` tool. Default: true
kkonganti@17 171
kkonganti@17 172 --mashsketch_l : List input. Lines in each <input> specify
kkonganti@17 173 paths to sequence files, one per line.
kkonganti@17 174 Default: true
kkonganti@17 175
kkonganti@17 176 --mashsketch_I : <path> ID field for sketch of reads (
kkonganti@17 177 instead of first sequence ID). Default:
kkonganti@17 178 false
kkonganti@17 179
kkonganti@17 180 --mashsketch_C : <path> Comment for a sketch of reads (
kkonganti@17 181 instead of first sequence comment). Default
kkonganti@17 182 : false
kkonganti@17 183
kkonganti@17 184 --mashsketch_k : <int> K-mer size. Hashes will be based on
kkonganti@17 185 strings of this many nucleotides.
kkonganti@17 186 Canonical nucleotides are used by default (
kkonganti@17 187 see Alphabet options below). (1-32) Default
kkonganti@17 188 : 21
kkonganti@17 189
kkonganti@17 190 --mashsketch_s : <int> Sketch size. Each sketch will have
kkonganti@17 191 at most this many non-redundant min-hashes
kkonganti@17 192 . Default: 1000
kkonganti@17 193
kkonganti@17 194 --mashsketch_i : Sketch individual sequences, rather than
kkonganti@17 195 whole files, e.g. for multi-fastas of
kkonganti@17 196 single-chromosome genomes or pair-wise gene
kkonganti@17 197 comparisons. Default: false
kkonganti@17 198
kkonganti@17 199 --mashsketch_S : <int> Seed to provide to the hash
kkonganti@17 200 function. (0-4294967296) [42] Default:
kkonganti@17 201 false
kkonganti@17 202
kkonganti@17 203 --mashsketch_w : <num> Probability threshold for warning
kkonganti@17 204 about low k-mer size. (0-1) Default: false
kkonganti@17 205
kkonganti@17 206 --mashsketch_r : Input is a read set. See Reads options
kkonganti@17 207 below. Incompatible with --mashsketch_i.
kkonganti@17 208 Default: false
kkonganti@17 209
kkonganti@17 210 --mashsketch_b : <size> Use a Bloom filter of this size (
kkonganti@17 211 raw bytes or with K/M/G/T) to filter out
kkonganti@17 212 unique k-mers. This is useful if exact
kkonganti@17 213 filtering with --mashsketch_m uses too much
kkonganti@17 214 memory. However, some unique k-mers may
kkonganti@17 215 pass erroneously, and copies cannot be
kkonganti@17 216 counted beyond 2. Implies --mashsketch_r.
kkonganti@17 217 Default: false
kkonganti@17 218
kkonganti@17 219 --mashsketch_m : <int> Minimum copies of each k-mer
kkonganti@17 220 required to pass noise filter for reads.
kkonganti@17 221 Implies --mashsketch_r. Default: false
kkonganti@17 222
kkonganti@17 223 --mashsketch_c : <num> Target coverage. Sketching will
kkonganti@17 224 conclude if this coverage is reached before
kkonganti@17 225 the end of the input file (estimated by
kkonganti@17 226 average k-mer multiplicity). Implies --
kkonganti@17 227 mashsketch_r. Default: false
kkonganti@17 228
kkonganti@17 229 --mashsketch_g : <size> Genome size (raw bases or with K/M/
kkonganti@17 230 G/T). If specified, will be used for p-
kkonganti@17 231 value calculation instead of an estimated
kkonganti@17 232 size from k-mer content. Implies --
kkonganti@17 233 mashsketch_r. Default: false
kkonganti@17 234
kkonganti@17 235 --mashsketch_n : Preserve strand (by default, strand is
kkonganti@17 236 ignored by using canonical DNA k-mers,
kkonganti@17 237 which are alphabetical minima of forward-
kkonganti@17 238 reverse pairs). Implied if an alphabet is
kkonganti@17 239 specified with --mashsketch_a or --
kkonganti@17 240 mashsketch_z. Default: false
kkonganti@17 241
kkonganti@17 242 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
kkonganti@17 243 ). Implies --mashsketch_n --mashsketch_k 9
kkonganti@17 244 . Default: false
kkonganti@17 245
kkonganti@17 246 --mashsketch_z : <text> Alphabet to base hashes on (case
kkonganti@17 247 ignored by default; see --mashsketch_Z). K-
kkonganti@17 248 mers with other characters will be ignored
kkonganti@17 249 . Implies --mashsketch_n. Default: false
kkonganti@17 250
kkonganti@17 251 --mashsketch_Z : Preserve case in k-mers and alphabet (case
kkonganti@17 252 is ignored by default). Sequence letters
kkonganti@17 253 whose case is not in the current alphabet
kkonganti@17 254 will be skipped when sketching. Default:
kkonganti@17 255 false
kkonganti@17 256
kkonganti@17 257 Help options :
kkonganti@17 258
kkonganti@17 259 --help : Display this message.
kkonganti@17 260
kkonganti@17 261 ```