Mercurial > repos > kkonganti > cfsan_bettercallsal
comparison 0.7.0/readme/bettercallsal_db.md @ 17:0e7a0053e4a6
planemo upload
author | kkonganti |
---|---|
date | Mon, 15 Jul 2024 10:42:02 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
16:b90e5a7a3d4f | 17:0e7a0053e4a6 |
---|---|
1 # bettercallsal_db | |
2 | |
3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2876`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`). | |
4 | |
5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection. | |
6 | |
7 \ | |
8 | |
9 | |
10 ## Workflow Usage | |
11 | |
12 ```bash | |
13 cpipes --pipeline bettercallsal_db [options] | |
14 ``` | |
15 | |
16 \ | |
17 | |
18 | |
19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876`. | |
20 | |
21 ```bash | |
22 cpipes | |
23 --pipeline bettercallsal_db \ | |
24 --pdg_release PDG000000002.2876 \ | |
25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876 | |
26 ``` | |
27 | |
28 \ | |
29 | |
30 | |
31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option. | |
32 | |
33 ```bash | |
34 cpipes | |
35 --pipeline bettercallsal \ | |
36 --input /path/to/illumina/fastq/dir \ | |
37 --output /path/to/output \ | |
38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876 | |
39 ``` | |
40 | |
41 \ | |
42 | |
43 | |
44 ## Note | |
45 | |
46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting. | |
47 | |
48 \ | |
49 | |
50 | |
51 ## `bettercallsal_db` CLI Help | |
52 | |
53 ```text | |
54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help | |
55 N E X T F L O W ~ version 23.04.3 | |
56 Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078 | |
57 ================================================================================ | |
58 (o) | |
59 ___ _ __ _ _ __ ___ ___ | |
60 / __|| '_ \ | || '_ \ / _ \/ __| | |
61 | (__ | |_) || || |_) || __/\__ \ | |
62 \___|| .__/ |_|| .__/ \___||___/ | |
63 | | | | | |
64 |_| |_| | |
65 -------------------------------------------------------------------------------- | |
66 A collection of modular pipelines at CFSAN, FDA. | |
67 -------------------------------------------------------------------------------- | |
68 Name : bettercallsal | |
69 Author : Kranti Konganti | |
70 Version : 0.7.0 | |
71 Center : CFSAN, FDA. | |
72 ================================================================================ | |
73 | |
74 Workflow : bettercallsal_db | |
75 | |
76 Author : Kranti Konganti | |
77 | |
78 Version : 0.7.0 | |
79 | |
80 | |
81 Required : | |
82 | |
83 --output : Absolute path to directory where all the | |
84 pipeline outputs should be stored. Ex: -- | |
85 output /path/to/output | |
86 | |
87 Other options : | |
88 | |
89 --wcomp_serocol : Column number (non 0-based index) of the | |
90 PDG metadata file by which the serotypes | |
91 are collected. Default: false | |
92 | |
93 --wcomp_seronamecol : Column number (non 0-based index) of the | |
94 PDG metadata file whose column name is " | |
95 serovar". Default: false | |
96 | |
97 --wcomp_acc_col : Column number (non 0-based index) of the | |
98 PDG metadata file whose column name is "acc | |
99 ". Default: false | |
100 | |
101 --wcomp_target_acc_col : Column number (non 0-based index) of the | |
102 PDG metadata file whose column name is " | |
103 target_acc". Default: false | |
104 | |
105 --wcomp_complete_sero : Skip indexing serotypes when the serotype | |
106 name in the column number 49 (non 0-based) | |
107 of PDG metadata file consists a "-". For | |
108 example, if an accession has a serotype= | |
109 string as such in column number 49 (non 0- | |
110 based): "serotype=- 13:z4,z23:-" then, the | |
111 indexing of that accession is skipped. | |
112 Default: false | |
113 | |
114 --wcomp_not_null_serovar : Only index the computed_serotype column i.e | |
115 . column number 49 (non 0-based), if the | |
116 serovar column is not NULL. Default: false | |
117 | |
118 --wcomp_i : Force include this serovar. Ignores -- | |
119 wcomp_complete_sero for only this serovar. | |
120 Mention multiple serovars separated by a | |
121 ! (Exclamation mark). Ex: -- | |
122 wcomp_complete_sero I 4,[5],12:i:-!Agona | |
123 Default: false | |
124 | |
125 --wcomp_num : Number of genome accessions to be collected | |
126 per serotype. Default: false | |
127 | |
128 --wcomp_min_contig_size : Minimum contig size to consider a genome | |
129 for indexing. Default: false | |
130 | |
131 --wsnp_serocol : Column number (non 0-based index) of the | |
132 PDG metadata file by which the serotypes | |
133 are collected. Default: false | |
134 | |
135 --wsnp_seronamecol : Column number (non 0-based index) of the | |
136 PDG metadata file whose column name is " | |
137 serovar". Default: false | |
138 | |
139 --wsnp_acc_col : Column number (non 0-based index) of the | |
140 PDG metadata file whose column name is "acc | |
141 ". Default: false | |
142 | |
143 --wsnp_target_acc_col : Column number (non 0-based index) of the | |
144 PDG metadata file whose column name is " | |
145 target_acc". Default: false | |
146 | |
147 --wsnp_complete_sero : Skip indexing serotypes when the serotype | |
148 name in the column number 49 (non 0-based) | |
149 of PDG metadata file consists a "-". For | |
150 example, if an accession has a serotype= | |
151 string as such in column number 49 (non 0- | |
152 based): "serotype=- 13:z4,z23:-" then, the | |
153 indexing of that accession is skipped. | |
154 Default: true | |
155 | |
156 --wsnp_not_null_serovar : Only index the computed_serotype column i.e | |
157 . column number 49 (non 0-based), if the | |
158 serovar column is not NULL. Default: false | |
159 | |
160 --wsnp_i : Force include this serovar. Ignores -- | |
161 wsnp_complete_sero for only this serovar. | |
162 Mention multiple serovars separated by a | |
163 ! (Exclamation mark). Ex: -- | |
164 wsnp_complete_sero I 4,[5],12:i:-!Agona | |
165 Default: 'I 4,[5],12:i | |
166 | |
167 --wsnp_num : Number of genome accessions to collect per | |
168 SNP cluster. Default: false | |
169 | |
170 --mashsketch_run : Run `mash screen` tool. Default: true | |
171 | |
172 --mashsketch_l : List input. Lines in each <input> specify | |
173 paths to sequence files, one per line. | |
174 Default: true | |
175 | |
176 --mashsketch_I : <path> ID field for sketch of reads ( | |
177 instead of first sequence ID). Default: | |
178 false | |
179 | |
180 --mashsketch_C : <path> Comment for a sketch of reads ( | |
181 instead of first sequence comment). Default | |
182 : false | |
183 | |
184 --mashsketch_k : <int> K-mer size. Hashes will be based on | |
185 strings of this many nucleotides. | |
186 Canonical nucleotides are used by default ( | |
187 see Alphabet options below). (1-32) Default | |
188 : 21 | |
189 | |
190 --mashsketch_s : <int> Sketch size. Each sketch will have | |
191 at most this many non-redundant min-hashes | |
192 . Default: 1000 | |
193 | |
194 --mashsketch_i : Sketch individual sequences, rather than | |
195 whole files, e.g. for multi-fastas of | |
196 single-chromosome genomes or pair-wise gene | |
197 comparisons. Default: false | |
198 | |
199 --mashsketch_S : <int> Seed to provide to the hash | |
200 function. (0-4294967296) [42] Default: | |
201 false | |
202 | |
203 --mashsketch_w : <num> Probability threshold for warning | |
204 about low k-mer size. (0-1) Default: false | |
205 | |
206 --mashsketch_r : Input is a read set. See Reads options | |
207 below. Incompatible with --mashsketch_i. | |
208 Default: false | |
209 | |
210 --mashsketch_b : <size> Use a Bloom filter of this size ( | |
211 raw bytes or with K/M/G/T) to filter out | |
212 unique k-mers. This is useful if exact | |
213 filtering with --mashsketch_m uses too much | |
214 memory. However, some unique k-mers may | |
215 pass erroneously, and copies cannot be | |
216 counted beyond 2. Implies --mashsketch_r. | |
217 Default: false | |
218 | |
219 --mashsketch_m : <int> Minimum copies of each k-mer | |
220 required to pass noise filter for reads. | |
221 Implies --mashsketch_r. Default: false | |
222 | |
223 --mashsketch_c : <num> Target coverage. Sketching will | |
224 conclude if this coverage is reached before | |
225 the end of the input file (estimated by | |
226 average k-mer multiplicity). Implies -- | |
227 mashsketch_r. Default: false | |
228 | |
229 --mashsketch_g : <size> Genome size (raw bases or with K/M/ | |
230 G/T). If specified, will be used for p- | |
231 value calculation instead of an estimated | |
232 size from k-mer content. Implies -- | |
233 mashsketch_r. Default: false | |
234 | |
235 --mashsketch_n : Preserve strand (by default, strand is | |
236 ignored by using canonical DNA k-mers, | |
237 which are alphabetical minima of forward- | |
238 reverse pairs). Implied if an alphabet is | |
239 specified with --mashsketch_a or -- | |
240 mashsketch_z. Default: false | |
241 | |
242 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ | |
243 ). Implies --mashsketch_n --mashsketch_k 9 | |
244 . Default: false | |
245 | |
246 --mashsketch_z : <text> Alphabet to base hashes on (case | |
247 ignored by default; see --mashsketch_Z). K- | |
248 mers with other characters will be ignored | |
249 . Implies --mashsketch_n. Default: false | |
250 | |
251 --mashsketch_Z : Preserve case in k-mers and alphabet (case | |
252 is ignored by default). Sequence letters | |
253 whose case is not in the current alphabet | |
254 will be skipped when sketching. Default: | |
255 false | |
256 | |
257 Help options : | |
258 | |
259 --help : Display this message. | |
260 | |
261 ``` |