annotate 0.6.1/readme/bettercallsal_db.md @ 16:b90e5a7a3d4f

"planemo upload"
author kkonganti
date Thu, 07 Sep 2023 15:22:10 -0400
parents 749faef1caa9
children
rev   line source
kkonganti@11 1 # bettercallsal_db
kkonganti@11 2
kkonganti@11 3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2727`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
kkonganti@11 4
kkonganti@11 5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
kkonganti@11 6
kkonganti@11 7 \
kkonganti@11 8  
kkonganti@11 9
kkonganti@11 10 ## Workflow Usage
kkonganti@11 11
kkonganti@11 12 ```bash
kkonganti@11 13 cpipes --pipeline bettercallsal_db [options]
kkonganti@11 14 ```
kkonganti@11 15
kkonganti@11 16 \
kkonganti@11 17  
kkonganti@11 18
kkonganti@11 19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2727`.
kkonganti@11 20
kkonganti@11 21 ```bash
kkonganti@11 22 cpipes
kkonganti@11 23 --pipeline bettercallsal_db \
kkonganti@11 24 --pdg_release PDG000000002.2727 \
kkonganti@11 25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2727
kkonganti@11 26 ```
kkonganti@11 27
kkonganti@11 28 \
kkonganti@11 29  
kkonganti@11 30
kkonganti@11 31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
kkonganti@11 32
kkonganti@11 33 ```bash
kkonganti@11 34 cpipes
kkonganti@11 35 --pipeline bettercallsal \
kkonganti@11 36 --input /path/to/illumina/fastq/dir \
kkonganti@11 37 --output /path/to/output \
kkonganti@11 38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2727
kkonganti@11 39 ```
kkonganti@11 40
kkonganti@11 41 \
kkonganti@11 42  
kkonganti@11 43
kkonganti@11 44 ## Note
kkonganti@11 45
kkonganti@11 46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
kkonganti@11 47
kkonganti@11 48 \
kkonganti@11 49  
kkonganti@11 50
kkonganti@11 51 ## `bettercallsal_db` CLI Help
kkonganti@11 52
kkonganti@11 53 ```text
kkonganti@11 54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
kkonganti@11 55 N E X T F L O W ~ version 23.04.3
kkonganti@11 56 Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078
kkonganti@11 57 ================================================================================
kkonganti@11 58 (o)
kkonganti@11 59 ___ _ __ _ _ __ ___ ___
kkonganti@11 60 / __|| '_ \ | || '_ \ / _ \/ __|
kkonganti@11 61 | (__ | |_) || || |_) || __/\__ \
kkonganti@11 62 \___|| .__/ |_|| .__/ \___||___/
kkonganti@11 63 | | | |
kkonganti@11 64 |_| |_|
kkonganti@11 65 --------------------------------------------------------------------------------
kkonganti@11 66 A collection of modular pipelines at CFSAN, FDA.
kkonganti@11 67 --------------------------------------------------------------------------------
kkonganti@11 68 Name : bettercallsal
kkonganti@11 69 Author : Kranti Konganti
kkonganti@11 70 Version : 0.6.1
kkonganti@11 71 Center : CFSAN, FDA.
kkonganti@11 72 ================================================================================
kkonganti@11 73
kkonganti@11 74 Workflow : bettercallsal_db
kkonganti@11 75
kkonganti@11 76 Author : Kranti Konganti
kkonganti@11 77
kkonganti@11 78 Version : 0.6.1
kkonganti@11 79
kkonganti@11 80
kkonganti@11 81 Required :
kkonganti@11 82
kkonganti@11 83 --output : Absolute path to directory where all the
kkonganti@11 84 pipeline outputs should be stored. Ex: --
kkonganti@11 85 output /path/to/output
kkonganti@11 86
kkonganti@11 87 Other options :
kkonganti@11 88
kkonganti@11 89 --wcomp_serocol : Column number (non 0-based index) of the
kkonganti@11 90 PDG metadata file by which the serotypes
kkonganti@11 91 are collected. Default: false
kkonganti@11 92
kkonganti@11 93 --wcomp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@11 94 name in the column number 49 (non 0-based)
kkonganti@11 95 of PDG metadata file consists a "-". For
kkonganti@11 96 example, if an accession has a serotype=
kkonganti@11 97 string as such in column number 49 (non 0-
kkonganti@11 98 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@11 99 indexing of that accession is skipped.
kkonganti@11 100 Default: false
kkonganti@11 101
kkonganti@11 102 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@11 103 . column number 49 (non 0-based), if the
kkonganti@11 104 serovar column is not NULL. Default: false
kkonganti@11 105
kkonganti@11 106 --wcomp_i : Force include this serovar. Ignores --
kkonganti@11 107 wcomp_complete_sero for only this serovar.
kkonganti@11 108 Mention multiple serovars separated by a
kkonganti@11 109 ! (Exclamation mark). Ex: --
kkonganti@11 110 wcomp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@11 111 Default: false
kkonganti@11 112
kkonganti@11 113 --wcomp_num : Number of genome accessions to be collected
kkonganti@11 114 per serotype. Default: false
kkonganti@11 115
kkonganti@11 116 --wcomp_min_contig_size : Minimum contig size to consider a genome
kkonganti@11 117 for indexing. Default: false
kkonganti@11 118
kkonganti@11 119 --wsnp_serocol : Column number (non 0-based index) of the
kkonganti@11 120 PDG metadata file by which the serotypes
kkonganti@11 121 are collected. Default: false
kkonganti@11 122
kkonganti@11 123 --wsnp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@11 124 name in the column number 49 (non 0-based)
kkonganti@11 125 of PDG metadata file consists a "-". For
kkonganti@11 126 example, if an accession has a serotype=
kkonganti@11 127 string as such in column number 49 (non 0-
kkonganti@11 128 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@11 129 indexing of that accession is skipped.
kkonganti@11 130 Default: true
kkonganti@11 131
kkonganti@11 132 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@11 133 . column number 49 (non 0-based), if the
kkonganti@11 134 serovar column is not NULL. Default: false
kkonganti@11 135
kkonganti@11 136 --wsnp_i : Force include this serovar. Ignores --
kkonganti@11 137 wsnp_complete_sero for only this serovar.
kkonganti@11 138 Mention multiple serovars separated by a
kkonganti@11 139 ! (Exclamation mark). Ex: --
kkonganti@11 140 wsnp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@11 141 Default: 'I 4,[5],12:i
kkonganti@11 142
kkonganti@11 143 --wsnp_num : Number of genome accessions to collect per
kkonganti@11 144 SNP cluster. Default: false
kkonganti@11 145
kkonganti@11 146 --mashsketch_run : Run `mash screen` tool. Default: true
kkonganti@11 147
kkonganti@11 148 --mashsketch_l : List input. Lines in each <input> specify
kkonganti@11 149 paths to sequence files, one per line.
kkonganti@11 150 Default: true
kkonganti@11 151
kkonganti@11 152 --mashsketch_I : <path> ID field for sketch of reads (
kkonganti@11 153 instead of first sequence ID). Default:
kkonganti@11 154 false
kkonganti@11 155
kkonganti@11 156 --mashsketch_C : <path> Comment for a sketch of reads (
kkonganti@11 157 instead of first sequence comment). Default
kkonganti@11 158 : false
kkonganti@11 159
kkonganti@11 160 --mashsketch_k : <int> K-mer size. Hashes will be based on
kkonganti@11 161 strings of this many nucleotides.
kkonganti@11 162 Canonical nucleotides are used by default (
kkonganti@11 163 see Alphabet options below). (1-32) Default
kkonganti@11 164 : 21
kkonganti@11 165
kkonganti@11 166 --mashsketch_s : <int> Sketch size. Each sketch will have
kkonganti@11 167 at most this many non-redundant min-hashes
kkonganti@11 168 . Default: 1000
kkonganti@11 169
kkonganti@11 170 --mashsketch_i : Sketch individual sequences, rather than
kkonganti@11 171 whole files, e.g. for multi-fastas of
kkonganti@11 172 single-chromosome genomes or pair-wise gene
kkonganti@11 173 comparisons. Default: false
kkonganti@11 174
kkonganti@11 175 --mashsketch_S : <int> Seed to provide to the hash
kkonganti@11 176 function. (0-4294967296) [42] Default:
kkonganti@11 177 false
kkonganti@11 178
kkonganti@11 179 --mashsketch_w : <num> Probability threshold for warning
kkonganti@11 180 about low k-mer size. (0-1) Default: false
kkonganti@11 181
kkonganti@11 182 --mashsketch_r : Input is a read set. See Reads options
kkonganti@11 183 below. Incompatible with --mashsketch_i.
kkonganti@11 184 Default: false
kkonganti@11 185
kkonganti@11 186 --mashsketch_b : <size> Use a Bloom filter of this size (
kkonganti@11 187 raw bytes or with K/M/G/T) to filter out
kkonganti@11 188 unique k-mers. This is useful if exact
kkonganti@11 189 filtering with --mashsketch_m uses too much
kkonganti@11 190 memory. However, some unique k-mers may
kkonganti@11 191 pass erroneously, and copies cannot be
kkonganti@11 192 counted beyond 2. Implies --mashsketch_r.
kkonganti@11 193 Default: false
kkonganti@11 194
kkonganti@11 195 --mashsketch_m : <int> Minimum copies of each k-mer
kkonganti@11 196 required to pass noise filter for reads.
kkonganti@11 197 Implies --mashsketch_r. Default: false
kkonganti@11 198
kkonganti@11 199 --mashsketch_c : <num> Target coverage. Sketching will
kkonganti@11 200 conclude if this coverage is reached before
kkonganti@11 201 the end of the input file (estimated by
kkonganti@11 202 average k-mer multiplicity). Implies --
kkonganti@11 203 mashsketch_r. Default: false
kkonganti@11 204
kkonganti@11 205 --mashsketch_g : <size> Genome size (raw bases or with K/M/
kkonganti@11 206 G/T). If specified, will be used for p-
kkonganti@11 207 value calculation instead of an estimated
kkonganti@11 208 size from k-mer content. Implies --
kkonganti@11 209 mashsketch_r. Default: false
kkonganti@11 210
kkonganti@11 211 --mashsketch_n : Preserve strand (by default, strand is
kkonganti@11 212 ignored by using canonical DNA k-mers,
kkonganti@11 213 which are alphabetical minima of forward-
kkonganti@11 214 reverse pairs). Implied if an alphabet is
kkonganti@11 215 specified with --mashsketch_a or --
kkonganti@11 216 mashsketch_z. Default: false
kkonganti@11 217
kkonganti@11 218 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
kkonganti@11 219 ). Implies --mashsketch_n --mashsketch_k 9
kkonganti@11 220 . Default: false
kkonganti@11 221
kkonganti@11 222 --mashsketch_z : <text> Alphabet to base hashes on (case
kkonganti@11 223 ignored by default; see --mashsketch_Z). K-
kkonganti@11 224 mers with other characters will be ignored
kkonganti@11 225 . Implies --mashsketch_n. Default: false
kkonganti@11 226
kkonganti@11 227 --mashsketch_Z : Preserve case in k-mers and alphabet (case
kkonganti@11 228 is ignored by default). Sequence letters
kkonganti@11 229 whose case is not in the current alphabet
kkonganti@11 230 will be skipped when sketching. Default:
kkonganti@11 231 false
kkonganti@11 232
kkonganti@11 233 Help options :
kkonganti@11 234
kkonganti@11 235 --help : Display this message.
kkonganti@11 236
kkonganti@11 237 ```