comparison 0.5.0/readme/bettercallsal_db.md @ 1:365849f031fd

"planemo upload"
author kkonganti
date Mon, 05 Jun 2023 18:48:51 -0400
parents
children
comparison
equal deleted inserted replaced
0:a4b1ee4b68b1 1:365849f031fd
1 # bettercallsal_db
2
3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2537`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
4
5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
6
7 \
8  
9
10 ## Workflow Usage
11
12 ```bash
13 cpipes --pipeline bettercallsal_db [options]
14 ```
15
16 \
17  
18
19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db`.
20
21 ```bash
22 cpipes
23 --pipeline bettercallsal_db \
24 --pdg_release PDG000000002.2537 \
25 --output /data/Kranti_Konganti/bettercallsal_db
26 ```
27
28 \
29  
30
31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
32
33 ```bash
34 cpipes
35 --pipeline bettercallsal \
36 --input /path/to/illumina/fastq/dir \
37 --output /path/to/output \
38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db
39 ```
40
41 \
42  
43
44 ## Note
45
46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by **Nextflow**. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
47
48 \
49  
50
51 ## `bettercallsal_db` CLI Help
52
53 ```text
54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
55 N E X T F L O W ~ version 22.10.0
56 Launching `./bettercallsal/cpipes` [hopeful_franklin] DSL2 - revision: 93f5293f50
57 ================================================================================
58 (o)
59 ___ _ __ _ _ __ ___ ___
60 / __|| '_ \ | || '_ \ / _ \/ __|
61 | (__ | |_) || || |_) || __/\__ \
62 \___|| .__/ |_|| .__/ \___||___/
63 | | | |
64 |_| |_|
65 --------------------------------------------------------------------------------
66 A collection of modular pipelines at CFSAN, FDA.
67 --------------------------------------------------------------------------------
68 Name : CPIPES
69 Author : Kranti Konganti
70 Version : 0.5.0
71 Center : CFSAN, FDA.
72 ================================================================================
73
74 Workflow : bettercallsal_db
75
76 Author : Kranti Konganti
77
78 Version : 0.4.0
79
80
81 Required :
82
83 --output : Absolute path to directory where all the
84 pipeline outputs should be stored. Ex: --
85 output /path/to/output
86
87 Other options :
88
89 --wcomp_serocol : Column number (non 0-based index) of the
90 PDG metadata file by which the serotypes
91 are collected. Default: false
92
93 --wcomp_complete_sero : Skip indexing serotypes when the serotype
94 name in the column number 49 (non 0-based)
95 of PDG metadata file consists a "-". For
96 example, if an accession has a serotype=
97 string as such in column number 49 (non 0-
98 based): "serotype=- 13:z4,z23:-" then, the
99 indexing of that accession is skipped.
100 Default: false
101
102 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
103 . column number 49 (non 0-based), if the
104 serovar column is not NULL. Default: false
105
106 --wcomp_i : Force include this serovar. Ignores --
107 wcomp_complete_sero for only this serovar.
108 Mention multiple serovars separated by a
109 ! (Exclamation mark). Ex: --
110 wcomp_complete_sero I 4,[5],12:i:-!Agona
111 Default: false
112
113 --wcomp_num : Number of genome accessions to be collected
114 per serotype. Default: false
115
116 --wcomp_min_contig_size : Minimum contig size to consider a genome
117 for indexing. Default: false
118
119 --wsnp_serocol : Column number (non 0-based index) of the
120 PDG metadata file by which the serotypes
121 are collected. Default: false
122
123 --wsnp_complete_sero : Skip indexing serotypes when the serotype
124 name in the column number 49 (non 0-based)
125 of PDG metadata file consists a "-". For
126 example, if an accession has a serotype=
127 string as such in column number 49 (non 0-
128 based): "serotype=- 13:z4,z23:-" then, the
129 indexing of that accession is skipped.
130 Default: true
131
132 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
133 . column number 49 (non 0-based), if the
134 serovar column is not NULL. Default: false
135
136 --wsnp_i : Force include this serovar. Ignores --
137 wsnp_complete_sero for only this serovar.
138 Mention multiple serovars separated by a
139 ! (Exclamation mark). Ex: --
140 wsnp_complete_sero I 4,[5],12:i:-!Agona
141 Default: 'I 4,[5],12:i
142
143 --wsnp_num : Number of genome accessions to collect per
144 SNP cluster. Default: false
145
146 --mashsketch_run : Run `mash screen` tool. Default: true
147
148 --mashsketch_l : List input. Lines in each <input> specify
149 paths to sequence files, one per line.
150 Default: true
151
152 --mashsketch_I : <path> ID field for sketch of reads (
153 instead of first sequence ID). Default:
154 false
155
156 --mashsketch_C : <path> Comment for a sketch of reads (
157 instead of first sequence comment). Default
158 : false
159
160 --mashsketch_k : <int> K-mer size. Hashes will be based on
161 strings of this many nucleotides.
162 Canonical nucleotides are used by default (
163 see Alphabet options below). (1-32) Default
164 : 21
165
166 --mashsketch_s : <int> Sketch size. Each sketch will have
167 at most this many non-redundant min-hashes
168 . Default: 1000
169
170 --mashsketch_i : Sketch individual sequences, rather than
171 whole files, e.g. for multi-fastas of
172 single-chromosome genomes or pair-wise gene
173 comparisons. Default: false
174
175 --mashsketch_S : <int> Seed to provide to the hash
176 function. (0-4294967296) [42] Default:
177 false
178
179 --mashsketch_w : <num> Probability threshold for warning
180 about low k-mer size. (0-1) Default: false
181
182 --mashsketch_r : Input is a read set. See Reads options
183 below. Incompatible with --mashsketch_i.
184 Default: false
185
186 --mashsketch_b : <size> Use a Bloom filter of this size (
187 raw bytes or with K/M/G/T) to filter out
188 unique k-mers. This is useful if exact
189 filtering with --mashsketch_m uses too much
190 memory. However, some unique k-mers may
191 pass erroneously, and copies cannot be
192 counted beyond 2. Implies --mashsketch_r.
193 Default: false
194
195 --mashsketch_m : <int> Minimum copies of each k-mer
196 required to pass noise filter for reads.
197 Implies --mashsketch_r. Default: false
198
199 --mashsketch_c : <num> Target coverage. Sketching will
200 conclude if this coverage is reached before
201 the end of the input file (estimated by
202 average k-mer multiplicity). Implies --
203 mashsketch_r. Default: false
204
205 --mashsketch_g : <size> Genome size (raw bases or with K/M/
206 G/T). If specified, will be used for p-
207 value calculation instead of an estimated
208 size from k-mer content. Implies --
209 mashsketch_r. Default: false
210
211 --mashsketch_n : Preserve strand (by default, strand is
212 ignored by using canonical DNA k-mers,
213 which are alphabetical minima of forward-
214 reverse pairs). Implied if an alphabet is
215 specified with --mashsketch_a or --
216 mashsketch_z. Default: false
217
218 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
219 ). Implies --mashsketch_n --mashsketch_k 9
220 . Default: false
221
222 --mashsketch_z : <text> Alphabet to base hashes on (case
223 ignored by default; see --mashsketch_Z). K-
224 mers with other characters will be ignored
225 . Implies --mashsketch_n. Default: false
226
227 --mashsketch_Z : Preserve case in k-mers and alphabet (case
228 is ignored by default). Sequence letters
229 whose case is not in the current alphabet
230 will be skipped when sketching. Default:
231 false
232
233 Help options :
234
235 --help : Display this message.
236
237 ```