Mercurial > repos > galaxytrakr > hfp_bettercallsal_awsbatch
comparison 1.0.0/readme/bettercallsal_db.md @ 0:801b85b03a17 draft default tip
planemo upload
| author | galaxytrakr |
|---|---|
| date | Thu, 28 May 2026 20:31:42 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:801b85b03a17 |
|---|---|
| 1 # bettercallsal_db | |
| 2 | |
| 3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.3082`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`). | |
| 4 | |
| 5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection. | |
| 6 | |
| 7 \ | |
| 8 | |
| 9 | |
| 10 ## Workflow Usage | |
| 11 | |
| 12 ```bash | |
| 13 cpipes --pipeline bettercallsal_db [options] | |
| 14 ``` | |
| 15 | |
| 16 \ | |
| 17 | |
| 18 | |
| 19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.3082`. | |
| 20 | |
| 21 ```bash | |
| 22 cpipes | |
| 23 --pipeline bettercallsal_db \ | |
| 24 --pdg_release PDG000000002.3082 \ | |
| 25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.3082 | |
| 26 ``` | |
| 27 | |
| 28 \ | |
| 29 | |
| 30 | |
| 31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option. | |
| 32 | |
| 33 ```bash | |
| 34 cpipes | |
| 35 --pipeline bettercallsal \ | |
| 36 --input /path/to/illumina/fastq/dir \ | |
| 37 --output /path/to/output \ | |
| 38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.3082 | |
| 39 ``` | |
| 40 | |
| 41 \ | |
| 42 | |
| 43 | |
| 44 ## Note | |
| 45 | |
| 46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting. | |
| 47 | |
| 48 \ | |
| 49 | |
| 50 | |
| 51 ## `bettercallsal_db` CLI Help | |
| 52 | |
| 53 ```text | |
| 54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help | |
| 55 | |
| 56 N E X T F L O W ~ version 24.04.3 | |
| 57 | |
| 58 Launching `~/apps/bettercallsal/1.0.0/cpipes` [shrivelled_hamilton] DSL2 - revision: d9b4be42be | |
| 59 | |
| 60 ================================================================================ | |
| 61 (o) | |
| 62 ___ _ __ _ _ __ ___ ___ | |
| 63 / __|| '_ \ | || '_ \ / _ \/ __| | |
| 64 | (__ | |_) || || |_) || __/\__ \ | |
| 65 \___|| .__/ |_|| .__/ \___||___/ | |
| 66 | | | | | |
| 67 |_| |_| | |
| 68 -------------------------------------------------------------------------------- | |
| 69 A collection of modular pipelines at CFSAN, FDA. | |
| 70 -------------------------------------------------------------------------------- | |
| 71 Name : bettercallsal | |
| 72 Author : Kranti Konganti | |
| 73 Version : 0.9.0 | |
| 74 Center : CFSAN, FDA. | |
| 75 ================================================================================ | |
| 76 | |
| 77 Workflow : bettercallsal_db | |
| 78 | |
| 79 Author : Kranti Konganti | |
| 80 | |
| 81 Version : 1.0.0 | |
| 82 | |
| 83 | |
| 84 Required : | |
| 85 | |
| 86 --output : Absolute path to directory where all the | |
| 87 pipeline outputs should be stored. Ex: -- | |
| 88 output /path/to/output | |
| 89 | |
| 90 Other options : | |
| 91 | |
| 92 --wcomp_serocol : Column number (non 0-based index) of the | |
| 93 PDG metadata file by which the serotypes | |
| 94 are collected. Default: false | |
| 95 | |
| 96 --wcomp_seronamecol : Column number (non 0-based index) of the | |
| 97 PDG metadata file whose column name is " | |
| 98 serovar". Default: false | |
| 99 | |
| 100 --wcomp_acc_col : Column number (non 0-based index) of the | |
| 101 PDG metadata file whose column name is "acc | |
| 102 ". Default: false | |
| 103 | |
| 104 --wcomp_target_acc_col : Column number (non 0-based index) of the | |
| 105 PDG metadata file whose column name is " | |
| 106 target_acc". Default: false | |
| 107 | |
| 108 --wcomp_complete_sero : Skip indexing serotypes when the serotype | |
| 109 name in the column number 49 (non 0-based) | |
| 110 of PDG metadata file consists a "-". For | |
| 111 example, if an accession has a serotype= | |
| 112 string as such in column number 49 (non 0- | |
| 113 based): "serotype=- 13:z4,z23:-" then, the | |
| 114 indexing of that accession is skipped. | |
| 115 Default: false | |
| 116 | |
| 117 --wcomp_not_null_serovar : Only index the computed_serotype column i.e | |
| 118 . column number 49 (non 0-based), if the | |
| 119 serovar column is not NULL. Default: false | |
| 120 | |
| 121 --wcomp_i : Force include this serovar. Ignores -- | |
| 122 wcomp_complete_sero for only this serovar. | |
| 123 Mention multiple serovars separated by a | |
| 124 ! (Exclamation mark). Ex: -- | |
| 125 wcomp_complete_sero I 4,[5],12:i:-!Agona | |
| 126 Default: false | |
| 127 | |
| 128 --wcomp_num : Number of genome accessions to be collected | |
| 129 per serotype. Default: false | |
| 130 | |
| 131 --wcomp_min_contig_size : Minimum contig size to consider a genome | |
| 132 for indexing. Default: false | |
| 133 | |
| 134 --wsnp_serocol : Column number (non 0-based index) of the | |
| 135 PDG metadata file by which the serotypes | |
| 136 are collected. Default: false | |
| 137 | |
| 138 --wsnp_seronamecol : Column number (non 0-based index) of the | |
| 139 PDG metadata file whose column name is " | |
| 140 serovar". Default: false | |
| 141 | |
| 142 --wsnp_acc_col : Column number (non 0-based index) of the | |
| 143 PDG metadata file whose column name is "acc | |
| 144 ". Default: false | |
| 145 | |
| 146 --wsnp_target_acc_col : Column number (non 0-based index) of the | |
| 147 PDG metadata file whose column name is " | |
| 148 target_acc". Default: false | |
| 149 | |
| 150 --wsnp_complete_sero : Skip indexing serotypes when the serotype | |
| 151 name in the column number 49 (non 0-based) | |
| 152 of PDG metadata file consists a "-". For | |
| 153 example, if an accession has a serotype= | |
| 154 string as such in column number 49 (non 0- | |
| 155 based): "serotype=- 13:z4,z23:-" then, the | |
| 156 indexing of that accession is skipped. | |
| 157 Default: true | |
| 158 | |
| 159 --wsnp_not_null_serovar : Only index the computed_serotype column i.e | |
| 160 . column number 49 (non 0-based), if the | |
| 161 serovar column is not NULL. Default: false | |
| 162 | |
| 163 --wsnp_i : Force include this serovar. Ignores -- | |
| 164 wsnp_complete_sero for only this serovar. | |
| 165 Mention multiple serovars separated by a | |
| 166 ! (Exclamation mark). Ex: -- | |
| 167 wsnp_complete_sero I 4,[5],12:i:-!Agona | |
| 168 Default: 'I 4,[5],12:i | |
| 169 | |
| 170 --wsnp_num : Number of genome accessions to collect per | |
| 171 SNP cluster. Default: false | |
| 172 | |
| 173 --mashsketch_run : Run `mash screen` tool. Default: true | |
| 174 | |
| 175 --mashsketch_l : List input. Lines in each <input> specify | |
| 176 paths to sequence files, one per line. | |
| 177 Default: true | |
| 178 | |
| 179 --mashsketch_I : <path> ID field for sketch of reads ( | |
| 180 instead of first sequence ID). Default: | |
| 181 false | |
| 182 | |
| 183 --mashsketch_C : <path> Comment for a sketch of reads ( | |
| 184 instead of first sequence comment). Default | |
| 185 : false | |
| 186 | |
| 187 --mashsketch_k : <int> K-mer size. Hashes will be based on | |
| 188 strings of this many nucleotides. | |
| 189 Canonical nucleotides are used by default ( | |
| 190 see Alphabet options below). (1-32) Default | |
| 191 : 21 | |
| 192 | |
| 193 --mashsketch_s : <int> Sketch size. Each sketch will have | |
| 194 at most this many non-redundant min-hashes | |
| 195 . Default: 1000 | |
| 196 | |
| 197 --mashsketch_i : Sketch individual sequences, rather than | |
| 198 whole files, e.g. for multi-fastas of | |
| 199 single-chromosome genomes or pair-wise gene | |
| 200 comparisons. Default: false | |
| 201 | |
| 202 --mashsketch_S : <int> Seed to provide to the hash | |
| 203 function. (0-4294967296) [42] Default: | |
| 204 false | |
| 205 | |
| 206 --mashsketch_w : <num> Probability threshold for warning | |
| 207 about low k-mer size. (0-1) Default: false | |
| 208 | |
| 209 --mashsketch_r : Input is a read set. See Reads options | |
| 210 below. Incompatible with --mashsketch_i. | |
| 211 Default: false | |
| 212 | |
| 213 --mashsketch_b : <size> Use a Bloom filter of this size ( | |
| 214 raw bytes or with K/M/G/T) to filter out | |
| 215 unique k-mers. This is useful if exact | |
| 216 filtering with --mashsketch_m uses too much | |
| 217 memory. However, some unique k-mers may | |
| 218 pass erroneously, and copies cannot be | |
| 219 counted beyond 2. Implies --mashsketch_r. | |
| 220 Default: false | |
| 221 | |
| 222 --mashsketch_m : <int> Minimum copies of each k-mer | |
| 223 required to pass noise filter for reads. | |
| 224 Implies --mashsketch_r. Default: false | |
| 225 | |
| 226 --mashsketch_c : <num> Target coverage. Sketching will | |
| 227 conclude if this coverage is reached before | |
| 228 the end of the input file (estimated by | |
| 229 average k-mer multiplicity). Implies -- | |
| 230 mashsketch_r. Default: false | |
| 231 | |
| 232 --mashsketch_g : <size> Genome size (raw bases or with K/M/ | |
| 233 G/T). If specified, will be used for p- | |
| 234 value calculation instead of an estimated | |
| 235 size from k-mer content. Implies -- | |
| 236 mashsketch_r. Default: false | |
| 237 | |
| 238 --mashsketch_n : Preserve strand (by default, strand is | |
| 239 ignored by using canonical DNA k-mers, | |
| 240 which are alphabetical minima of forward- | |
| 241 reverse pairs). Implied if an alphabet is | |
| 242 specified with --mashsketch_a or -- | |
| 243 mashsketch_z. Default: false | |
| 244 | |
| 245 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ | |
| 246 ). Implies --mashsketch_n --mashsketch_k 9 | |
| 247 . Default: false | |
| 248 | |
| 249 --mashsketch_z : <text> Alphabet to base hashes on (case | |
| 250 ignored by default; see --mashsketch_Z). K- | |
| 251 mers with other characters will be ignored | |
| 252 . Implies --mashsketch_n. Default: false | |
| 253 | |
| 254 --mashsketch_Z : Preserve case in k-mers and alphabet (case | |
| 255 is ignored by default). Sequence letters | |
| 256 whose case is not in the current alphabet | |
| 257 will be skipped when sketching. Default: | |
| 258 false | |
| 259 | |
| 260 Help options : | |
| 261 | |
| 262 --help : Display this message. | |
| 263 | |
| 264 ``` |
