kkonganti@17
|
1 # bettercallsal_db
|
kkonganti@17
|
2
|
kkonganti@17
|
3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2876`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
|
kkonganti@17
|
4
|
kkonganti@17
|
5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
|
kkonganti@17
|
6
|
kkonganti@17
|
7 \
|
kkonganti@17
|
8
|
kkonganti@17
|
9
|
kkonganti@17
|
10 ## Workflow Usage
|
kkonganti@17
|
11
|
kkonganti@17
|
12 ```bash
|
kkonganti@17
|
13 cpipes --pipeline bettercallsal_db [options]
|
kkonganti@17
|
14 ```
|
kkonganti@17
|
15
|
kkonganti@17
|
16 \
|
kkonganti@17
|
17
|
kkonganti@17
|
18
|
kkonganti@17
|
19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876`.
|
kkonganti@17
|
20
|
kkonganti@17
|
21 ```bash
|
kkonganti@17
|
22 cpipes
|
kkonganti@17
|
23 --pipeline bettercallsal_db \
|
kkonganti@17
|
24 --pdg_release PDG000000002.2876 \
|
kkonganti@17
|
25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
|
kkonganti@17
|
26 ```
|
kkonganti@17
|
27
|
kkonganti@17
|
28 \
|
kkonganti@17
|
29
|
kkonganti@17
|
30
|
kkonganti@17
|
31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
|
kkonganti@17
|
32
|
kkonganti@17
|
33 ```bash
|
kkonganti@17
|
34 cpipes
|
kkonganti@17
|
35 --pipeline bettercallsal \
|
kkonganti@17
|
36 --input /path/to/illumina/fastq/dir \
|
kkonganti@17
|
37 --output /path/to/output \
|
kkonganti@17
|
38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
|
kkonganti@17
|
39 ```
|
kkonganti@17
|
40
|
kkonganti@17
|
41 \
|
kkonganti@17
|
42
|
kkonganti@17
|
43
|
kkonganti@17
|
44 ## Note
|
kkonganti@17
|
45
|
kkonganti@17
|
46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
|
kkonganti@17
|
47
|
kkonganti@17
|
48 \
|
kkonganti@17
|
49
|
kkonganti@17
|
50
|
kkonganti@17
|
51 ## `bettercallsal_db` CLI Help
|
kkonganti@17
|
52
|
kkonganti@17
|
53 ```text
|
kkonganti@17
|
54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
|
kkonganti@17
|
55 N E X T F L O W ~ version 23.04.3
|
kkonganti@17
|
56 Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078
|
kkonganti@17
|
57 ================================================================================
|
kkonganti@17
|
58 (o)
|
kkonganti@17
|
59 ___ _ __ _ _ __ ___ ___
|
kkonganti@17
|
60 / __|| '_ \ | || '_ \ / _ \/ __|
|
kkonganti@17
|
61 | (__ | |_) || || |_) || __/\__ \
|
kkonganti@17
|
62 \___|| .__/ |_|| .__/ \___||___/
|
kkonganti@17
|
63 | | | |
|
kkonganti@17
|
64 |_| |_|
|
kkonganti@17
|
65 --------------------------------------------------------------------------------
|
kkonganti@17
|
66 A collection of modular pipelines at CFSAN, FDA.
|
kkonganti@17
|
67 --------------------------------------------------------------------------------
|
kkonganti@17
|
68 Name : bettercallsal
|
kkonganti@17
|
69 Author : Kranti Konganti
|
kkonganti@17
|
70 Version : 0.7.0
|
kkonganti@17
|
71 Center : CFSAN, FDA.
|
kkonganti@17
|
72 ================================================================================
|
kkonganti@17
|
73
|
kkonganti@17
|
74 Workflow : bettercallsal_db
|
kkonganti@17
|
75
|
kkonganti@17
|
76 Author : Kranti Konganti
|
kkonganti@17
|
77
|
kkonganti@17
|
78 Version : 0.7.0
|
kkonganti@17
|
79
|
kkonganti@17
|
80
|
kkonganti@17
|
81 Required :
|
kkonganti@17
|
82
|
kkonganti@17
|
83 --output : Absolute path to directory where all the
|
kkonganti@17
|
84 pipeline outputs should be stored. Ex: --
|
kkonganti@17
|
85 output /path/to/output
|
kkonganti@17
|
86
|
kkonganti@17
|
87 Other options :
|
kkonganti@17
|
88
|
kkonganti@17
|
89 --wcomp_serocol : Column number (non 0-based index) of the
|
kkonganti@17
|
90 PDG metadata file by which the serotypes
|
kkonganti@17
|
91 are collected. Default: false
|
kkonganti@17
|
92
|
kkonganti@17
|
93 --wcomp_seronamecol : Column number (non 0-based index) of the
|
kkonganti@17
|
94 PDG metadata file whose column name is "
|
kkonganti@17
|
95 serovar". Default: false
|
kkonganti@17
|
96
|
kkonganti@17
|
97 --wcomp_acc_col : Column number (non 0-based index) of the
|
kkonganti@17
|
98 PDG metadata file whose column name is "acc
|
kkonganti@17
|
99 ". Default: false
|
kkonganti@17
|
100
|
kkonganti@17
|
101 --wcomp_target_acc_col : Column number (non 0-based index) of the
|
kkonganti@17
|
102 PDG metadata file whose column name is "
|
kkonganti@17
|
103 target_acc". Default: false
|
kkonganti@17
|
104
|
kkonganti@17
|
105 --wcomp_complete_sero : Skip indexing serotypes when the serotype
|
kkonganti@17
|
106 name in the column number 49 (non 0-based)
|
kkonganti@17
|
107 of PDG metadata file consists a "-". For
|
kkonganti@17
|
108 example, if an accession has a serotype=
|
kkonganti@17
|
109 string as such in column number 49 (non 0-
|
kkonganti@17
|
110 based): "serotype=- 13:z4,z23:-" then, the
|
kkonganti@17
|
111 indexing of that accession is skipped.
|
kkonganti@17
|
112 Default: false
|
kkonganti@17
|
113
|
kkonganti@17
|
114 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
|
kkonganti@17
|
115 . column number 49 (non 0-based), if the
|
kkonganti@17
|
116 serovar column is not NULL. Default: false
|
kkonganti@17
|
117
|
kkonganti@17
|
118 --wcomp_i : Force include this serovar. Ignores --
|
kkonganti@17
|
119 wcomp_complete_sero for only this serovar.
|
kkonganti@17
|
120 Mention multiple serovars separated by a
|
kkonganti@17
|
121 ! (Exclamation mark). Ex: --
|
kkonganti@17
|
122 wcomp_complete_sero I 4,[5],12:i:-!Agona
|
kkonganti@17
|
123 Default: false
|
kkonganti@17
|
124
|
kkonganti@17
|
125 --wcomp_num : Number of genome accessions to be collected
|
kkonganti@17
|
126 per serotype. Default: false
|
kkonganti@17
|
127
|
kkonganti@17
|
128 --wcomp_min_contig_size : Minimum contig size to consider a genome
|
kkonganti@17
|
129 for indexing. Default: false
|
kkonganti@17
|
130
|
kkonganti@17
|
131 --wsnp_serocol : Column number (non 0-based index) of the
|
kkonganti@17
|
132 PDG metadata file by which the serotypes
|
kkonganti@17
|
133 are collected. Default: false
|
kkonganti@17
|
134
|
kkonganti@17
|
135 --wsnp_seronamecol : Column number (non 0-based index) of the
|
kkonganti@17
|
136 PDG metadata file whose column name is "
|
kkonganti@17
|
137 serovar". Default: false
|
kkonganti@17
|
138
|
kkonganti@17
|
139 --wsnp_acc_col : Column number (non 0-based index) of the
|
kkonganti@17
|
140 PDG metadata file whose column name is "acc
|
kkonganti@17
|
141 ". Default: false
|
kkonganti@17
|
142
|
kkonganti@17
|
143 --wsnp_target_acc_col : Column number (non 0-based index) of the
|
kkonganti@17
|
144 PDG metadata file whose column name is "
|
kkonganti@17
|
145 target_acc". Default: false
|
kkonganti@17
|
146
|
kkonganti@17
|
147 --wsnp_complete_sero : Skip indexing serotypes when the serotype
|
kkonganti@17
|
148 name in the column number 49 (non 0-based)
|
kkonganti@17
|
149 of PDG metadata file consists a "-". For
|
kkonganti@17
|
150 example, if an accession has a serotype=
|
kkonganti@17
|
151 string as such in column number 49 (non 0-
|
kkonganti@17
|
152 based): "serotype=- 13:z4,z23:-" then, the
|
kkonganti@17
|
153 indexing of that accession is skipped.
|
kkonganti@17
|
154 Default: true
|
kkonganti@17
|
155
|
kkonganti@17
|
156 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
|
kkonganti@17
|
157 . column number 49 (non 0-based), if the
|
kkonganti@17
|
158 serovar column is not NULL. Default: false
|
kkonganti@17
|
159
|
kkonganti@17
|
160 --wsnp_i : Force include this serovar. Ignores --
|
kkonganti@17
|
161 wsnp_complete_sero for only this serovar.
|
kkonganti@17
|
162 Mention multiple serovars separated by a
|
kkonganti@17
|
163 ! (Exclamation mark). Ex: --
|
kkonganti@17
|
164 wsnp_complete_sero I 4,[5],12:i:-!Agona
|
kkonganti@17
|
165 Default: 'I 4,[5],12:i
|
kkonganti@17
|
166
|
kkonganti@17
|
167 --wsnp_num : Number of genome accessions to collect per
|
kkonganti@17
|
168 SNP cluster. Default: false
|
kkonganti@17
|
169
|
kkonganti@17
|
170 --mashsketch_run : Run `mash screen` tool. Default: true
|
kkonganti@17
|
171
|
kkonganti@17
|
172 --mashsketch_l : List input. Lines in each <input> specify
|
kkonganti@17
|
173 paths to sequence files, one per line.
|
kkonganti@17
|
174 Default: true
|
kkonganti@17
|
175
|
kkonganti@17
|
176 --mashsketch_I : <path> ID field for sketch of reads (
|
kkonganti@17
|
177 instead of first sequence ID). Default:
|
kkonganti@17
|
178 false
|
kkonganti@17
|
179
|
kkonganti@17
|
180 --mashsketch_C : <path> Comment for a sketch of reads (
|
kkonganti@17
|
181 instead of first sequence comment). Default
|
kkonganti@17
|
182 : false
|
kkonganti@17
|
183
|
kkonganti@17
|
184 --mashsketch_k : <int> K-mer size. Hashes will be based on
|
kkonganti@17
|
185 strings of this many nucleotides.
|
kkonganti@17
|
186 Canonical nucleotides are used by default (
|
kkonganti@17
|
187 see Alphabet options below). (1-32) Default
|
kkonganti@17
|
188 : 21
|
kkonganti@17
|
189
|
kkonganti@17
|
190 --mashsketch_s : <int> Sketch size. Each sketch will have
|
kkonganti@17
|
191 at most this many non-redundant min-hashes
|
kkonganti@17
|
192 . Default: 1000
|
kkonganti@17
|
193
|
kkonganti@17
|
194 --mashsketch_i : Sketch individual sequences, rather than
|
kkonganti@17
|
195 whole files, e.g. for multi-fastas of
|
kkonganti@17
|
196 single-chromosome genomes or pair-wise gene
|
kkonganti@17
|
197 comparisons. Default: false
|
kkonganti@17
|
198
|
kkonganti@17
|
199 --mashsketch_S : <int> Seed to provide to the hash
|
kkonganti@17
|
200 function. (0-4294967296) [42] Default:
|
kkonganti@17
|
201 false
|
kkonganti@17
|
202
|
kkonganti@17
|
203 --mashsketch_w : <num> Probability threshold for warning
|
kkonganti@17
|
204 about low k-mer size. (0-1) Default: false
|
kkonganti@17
|
205
|
kkonganti@17
|
206 --mashsketch_r : Input is a read set. See Reads options
|
kkonganti@17
|
207 below. Incompatible with --mashsketch_i.
|
kkonganti@17
|
208 Default: false
|
kkonganti@17
|
209
|
kkonganti@17
|
210 --mashsketch_b : <size> Use a Bloom filter of this size (
|
kkonganti@17
|
211 raw bytes or with K/M/G/T) to filter out
|
kkonganti@17
|
212 unique k-mers. This is useful if exact
|
kkonganti@17
|
213 filtering with --mashsketch_m uses too much
|
kkonganti@17
|
214 memory. However, some unique k-mers may
|
kkonganti@17
|
215 pass erroneously, and copies cannot be
|
kkonganti@17
|
216 counted beyond 2. Implies --mashsketch_r.
|
kkonganti@17
|
217 Default: false
|
kkonganti@17
|
218
|
kkonganti@17
|
219 --mashsketch_m : <int> Minimum copies of each k-mer
|
kkonganti@17
|
220 required to pass noise filter for reads.
|
kkonganti@17
|
221 Implies --mashsketch_r. Default: false
|
kkonganti@17
|
222
|
kkonganti@17
|
223 --mashsketch_c : <num> Target coverage. Sketching will
|
kkonganti@17
|
224 conclude if this coverage is reached before
|
kkonganti@17
|
225 the end of the input file (estimated by
|
kkonganti@17
|
226 average k-mer multiplicity). Implies --
|
kkonganti@17
|
227 mashsketch_r. Default: false
|
kkonganti@17
|
228
|
kkonganti@17
|
229 --mashsketch_g : <size> Genome size (raw bases or with K/M/
|
kkonganti@17
|
230 G/T). If specified, will be used for p-
|
kkonganti@17
|
231 value calculation instead of an estimated
|
kkonganti@17
|
232 size from k-mer content. Implies --
|
kkonganti@17
|
233 mashsketch_r. Default: false
|
kkonganti@17
|
234
|
kkonganti@17
|
235 --mashsketch_n : Preserve strand (by default, strand is
|
kkonganti@17
|
236 ignored by using canonical DNA k-mers,
|
kkonganti@17
|
237 which are alphabetical minima of forward-
|
kkonganti@17
|
238 reverse pairs). Implied if an alphabet is
|
kkonganti@17
|
239 specified with --mashsketch_a or --
|
kkonganti@17
|
240 mashsketch_z. Default: false
|
kkonganti@17
|
241
|
kkonganti@17
|
242 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
|
kkonganti@17
|
243 ). Implies --mashsketch_n --mashsketch_k 9
|
kkonganti@17
|
244 . Default: false
|
kkonganti@17
|
245
|
kkonganti@17
|
246 --mashsketch_z : <text> Alphabet to base hashes on (case
|
kkonganti@17
|
247 ignored by default; see --mashsketch_Z). K-
|
kkonganti@17
|
248 mers with other characters will be ignored
|
kkonganti@17
|
249 . Implies --mashsketch_n. Default: false
|
kkonganti@17
|
250
|
kkonganti@17
|
251 --mashsketch_Z : Preserve case in k-mers and alphabet (case
|
kkonganti@17
|
252 is ignored by default). Sequence letters
|
kkonganti@17
|
253 whose case is not in the current alphabet
|
kkonganti@17
|
254 will be skipped when sketching. Default:
|
kkonganti@17
|
255 false
|
kkonganti@17
|
256
|
kkonganti@17
|
257 Help options :
|
kkonganti@17
|
258
|
kkonganti@17
|
259 --help : Display this message.
|
kkonganti@17
|
260
|
kkonganti@17
|
261 ```
|