Mercurial > repos > kkonganti > cfsan_bettercallsal
comparison 0.5.0/readme/bettercallsal_db.md @ 1:365849f031fd
"planemo upload"
author | kkonganti |
---|---|
date | Mon, 05 Jun 2023 18:48:51 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
0:a4b1ee4b68b1 | 1:365849f031fd |
---|---|
1 # bettercallsal_db | |
2 | |
3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2537`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`). | |
4 | |
5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection. | |
6 | |
7 \ | |
8 | |
9 | |
10 ## Workflow Usage | |
11 | |
12 ```bash | |
13 cpipes --pipeline bettercallsal_db [options] | |
14 ``` | |
15 | |
16 \ | |
17 | |
18 | |
19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db`. | |
20 | |
21 ```bash | |
22 cpipes | |
23 --pipeline bettercallsal_db \ | |
24 --pdg_release PDG000000002.2537 \ | |
25 --output /data/Kranti_Konganti/bettercallsal_db | |
26 ``` | |
27 | |
28 \ | |
29 | |
30 | |
31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option. | |
32 | |
33 ```bash | |
34 cpipes | |
35 --pipeline bettercallsal \ | |
36 --input /path/to/illumina/fastq/dir \ | |
37 --output /path/to/output \ | |
38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db | |
39 ``` | |
40 | |
41 \ | |
42 | |
43 | |
44 ## Note | |
45 | |
46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting. | |
47 | |
48 \ | |
49 | |
50 | |
51 ## `bettercallsal_db` CLI Help | |
52 | |
53 ```text | |
54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help | |
55 N E X T F L O W ~ version 22.10.0 | |
56 Launching `./bettercallsal/cpipes` [hopeful_franklin] DSL2 - revision: 93f5293f50 | |
57 ================================================================================ | |
58 (o) | |
59 ___ _ __ _ _ __ ___ ___ | |
60 / __|| '_ \ | || '_ \ / _ \/ __| | |
61 | (__ | |_) || || |_) || __/\__ \ | |
62 \___|| .__/ |_|| .__/ \___||___/ | |
63 | | | | | |
64 |_| |_| | |
65 -------------------------------------------------------------------------------- | |
66 A collection of modular pipelines at CFSAN, FDA. | |
67 -------------------------------------------------------------------------------- | |
68 Name : CPIPES | |
69 Author : Kranti Konganti | |
70 Version : 0.5.0 | |
71 Center : CFSAN, FDA. | |
72 ================================================================================ | |
73 | |
74 Workflow : bettercallsal_db | |
75 | |
76 Author : Kranti Konganti | |
77 | |
78 Version : 0.4.0 | |
79 | |
80 | |
81 Required : | |
82 | |
83 --output : Absolute path to directory where all the | |
84 pipeline outputs should be stored. Ex: -- | |
85 output /path/to/output | |
86 | |
87 Other options : | |
88 | |
89 --wcomp_serocol : Column number (non 0-based index) of the | |
90 PDG metadata file by which the serotypes | |
91 are collected. Default: false | |
92 | |
93 --wcomp_complete_sero : Skip indexing serotypes when the serotype | |
94 name in the column number 49 (non 0-based) | |
95 of PDG metadata file consists a "-". For | |
96 example, if an accession has a serotype= | |
97 string as such in column number 49 (non 0- | |
98 based): "serotype=- 13:z4,z23:-" then, the | |
99 indexing of that accession is skipped. | |
100 Default: false | |
101 | |
102 --wcomp_not_null_serovar : Only index the computed_serotype column i.e | |
103 . column number 49 (non 0-based), if the | |
104 serovar column is not NULL. Default: false | |
105 | |
106 --wcomp_i : Force include this serovar. Ignores -- | |
107 wcomp_complete_sero for only this serovar. | |
108 Mention multiple serovars separated by a | |
109 ! (Exclamation mark). Ex: -- | |
110 wcomp_complete_sero I 4,[5],12:i:-!Agona | |
111 Default: false | |
112 | |
113 --wcomp_num : Number of genome accessions to be collected | |
114 per serotype. Default: false | |
115 | |
116 --wcomp_min_contig_size : Minimum contig size to consider a genome | |
117 for indexing. Default: false | |
118 | |
119 --wsnp_serocol : Column number (non 0-based index) of the | |
120 PDG metadata file by which the serotypes | |
121 are collected. Default: false | |
122 | |
123 --wsnp_complete_sero : Skip indexing serotypes when the serotype | |
124 name in the column number 49 (non 0-based) | |
125 of PDG metadata file consists a "-". For | |
126 example, if an accession has a serotype= | |
127 string as such in column number 49 (non 0- | |
128 based): "serotype=- 13:z4,z23:-" then, the | |
129 indexing of that accession is skipped. | |
130 Default: true | |
131 | |
132 --wsnp_not_null_serovar : Only index the computed_serotype column i.e | |
133 . column number 49 (non 0-based), if the | |
134 serovar column is not NULL. Default: false | |
135 | |
136 --wsnp_i : Force include this serovar. Ignores -- | |
137 wsnp_complete_sero for only this serovar. | |
138 Mention multiple serovars separated by a | |
139 ! (Exclamation mark). Ex: -- | |
140 wsnp_complete_sero I 4,[5],12:i:-!Agona | |
141 Default: 'I 4,[5],12:i | |
142 | |
143 --wsnp_num : Number of genome accessions to collect per | |
144 SNP cluster. Default: false | |
145 | |
146 --mashsketch_run : Run `mash screen` tool. Default: true | |
147 | |
148 --mashsketch_l : List input. Lines in each <input> specify | |
149 paths to sequence files, one per line. | |
150 Default: true | |
151 | |
152 --mashsketch_I : <path> ID field for sketch of reads ( | |
153 instead of first sequence ID). Default: | |
154 false | |
155 | |
156 --mashsketch_C : <path> Comment for a sketch of reads ( | |
157 instead of first sequence comment). Default | |
158 : false | |
159 | |
160 --mashsketch_k : <int> K-mer size. Hashes will be based on | |
161 strings of this many nucleotides. | |
162 Canonical nucleotides are used by default ( | |
163 see Alphabet options below). (1-32) Default | |
164 : 21 | |
165 | |
166 --mashsketch_s : <int> Sketch size. Each sketch will have | |
167 at most this many non-redundant min-hashes | |
168 . Default: 1000 | |
169 | |
170 --mashsketch_i : Sketch individual sequences, rather than | |
171 whole files, e.g. for multi-fastas of | |
172 single-chromosome genomes or pair-wise gene | |
173 comparisons. Default: false | |
174 | |
175 --mashsketch_S : <int> Seed to provide to the hash | |
176 function. (0-4294967296) [42] Default: | |
177 false | |
178 | |
179 --mashsketch_w : <num> Probability threshold for warning | |
180 about low k-mer size. (0-1) Default: false | |
181 | |
182 --mashsketch_r : Input is a read set. See Reads options | |
183 below. Incompatible with --mashsketch_i. | |
184 Default: false | |
185 | |
186 --mashsketch_b : <size> Use a Bloom filter of this size ( | |
187 raw bytes or with K/M/G/T) to filter out | |
188 unique k-mers. This is useful if exact | |
189 filtering with --mashsketch_m uses too much | |
190 memory. However, some unique k-mers may | |
191 pass erroneously, and copies cannot be | |
192 counted beyond 2. Implies --mashsketch_r. | |
193 Default: false | |
194 | |
195 --mashsketch_m : <int> Minimum copies of each k-mer | |
196 required to pass noise filter for reads. | |
197 Implies --mashsketch_r. Default: false | |
198 | |
199 --mashsketch_c : <num> Target coverage. Sketching will | |
200 conclude if this coverage is reached before | |
201 the end of the input file (estimated by | |
202 average k-mer multiplicity). Implies -- | |
203 mashsketch_r. Default: false | |
204 | |
205 --mashsketch_g : <size> Genome size (raw bases or with K/M/ | |
206 G/T). If specified, will be used for p- | |
207 value calculation instead of an estimated | |
208 size from k-mer content. Implies -- | |
209 mashsketch_r. Default: false | |
210 | |
211 --mashsketch_n : Preserve strand (by default, strand is | |
212 ignored by using canonical DNA k-mers, | |
213 which are alphabetical minima of forward- | |
214 reverse pairs). Implied if an alphabet is | |
215 specified with --mashsketch_a or -- | |
216 mashsketch_z. Default: false | |
217 | |
218 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ | |
219 ). Implies --mashsketch_n --mashsketch_k 9 | |
220 . Default: false | |
221 | |
222 --mashsketch_z : <text> Alphabet to base hashes on (case | |
223 ignored by default; see --mashsketch_Z). K- | |
224 mers with other characters will be ignored | |
225 . Implies --mashsketch_n. Default: false | |
226 | |
227 --mashsketch_Z : Preserve case in k-mers and alphabet (case | |
228 is ignored by default). Sequence letters | |
229 whose case is not in the current alphabet | |
230 will be skipped when sketching. Default: | |
231 false | |
232 | |
233 Help options : | |
234 | |
235 --help : Display this message. | |
236 | |
237 ``` |