Mercurial > repos > kkonganti > cfsan_bettercallsal
comparison 0.5.0/readme/bettercallsal.md @ 1:365849f031fd
"planemo upload"
author | kkonganti |
---|---|
date | Mon, 05 Jun 2023 18:48:51 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
0:a4b1ee4b68b1 | 1:365849f031fd |
---|---|
1 # bettercallsal | |
2 | |
3 `bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture. | |
4 | |
5 \ | |
6 | |
7 | |
8 <!-- TOC --> | |
9 | |
10 - [Minimum Requirements](#minimum-requirements) | |
11 - [Usage and Examples](#usage-and-examples) | |
12 - [Database](#database) | |
13 - [Input](#input) | |
14 - [Output](#output) | |
15 - [Computational resources](#computational-resources) | |
16 - [Runtime profiles](#runtime-profiles) | |
17 - [your_institution.config](#your_institutionconfig) | |
18 - [Cloud computing](#cloud-computing) | |
19 - [Example data](#example-data) | |
20 - [Using sourmash](#using-sourmash) | |
21 - [bettercallsal CLI Help](#bettercallsal-cli-help) | |
22 | |
23 <!-- /TOC --> | |
24 | |
25 \ | |
26 | |
27 | |
28 ## Minimum Requirements | |
29 | |
30 1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow). | |
31 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`. | |
32 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz). | |
33 2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`. | |
34 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers. | |
35 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`. | |
36 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned. | |
37 3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big. | |
38 | |
39 \ | |
40 | |
41 | |
42 ## Usage and Examples | |
43 | |
44 Clone or download this repository and then call `cpipes`. | |
45 | |
46 ```bash | |
47 cpipes --pipeline bettercallsal [options] | |
48 ``` | |
49 | |
50 \ | |
51 | |
52 | |
53 **Example**: Run the default `bettercallsal` pipeline in single-end mode. | |
54 | |
55 ```bash | |
56 cd /data/scratch/$USER | |
57 mkdir nf-cpipes | |
58 cd nf-cpipes | |
59 cpipes | |
60 --pipeline bettercallsal \ | |
61 --input /path/to/illumina/fastq/dir \ | |
62 --output /path/to/output \ | |
63 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db | |
64 ``` | |
65 | |
66 \ | |
67 | |
68 | |
69 **Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [preprint](https://www.biorxiv.org/content/10.1101/2023.04.06.535929v1.full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`. | |
70 | |
71 ```bash | |
72 cd /data/scratch/$USER | |
73 mkdir nf-cpipes | |
74 cd nf-cpipes | |
75 cpipes \ | |
76 --pipeline bettercallsal \ | |
77 --input /path/to/illumina/fastq/dir \ | |
78 --output /path/to/output \ | |
79 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db \ | |
80 --fq_single_end false \ | |
81 --fq_suffix '_R1_001.fastq.gz' | |
82 ``` | |
83 | |
84 \ | |
85 | |
86 | |
87 ### Database | |
88 | |
89 --- | |
90 | |
91 The successful run of the workflow requires certain database flat files specific for the workflow. | |
92 | |
93 Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release. | |
94 | |
95 | |
96 | |
97 ### Input | |
98 | |
99 --- | |
100 | |
101 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1. | |
102 | |
103 For example, if the directory contains FASTQ files as shown below: | |
104 | |
105 - KB-01_apple_L001_R1.fastq.gz | |
106 - KB-01_apple_L001_R2.fastq.gz | |
107 - KB-01_apple_L002_R1.fastq.gz | |
108 - KB-01_apple_L002_R2.fastq.gz | |
109 - KB-02_mango_L001_R1.fastq.gz | |
110 - KB-02_mango_L001_R2.fastq.gz | |
111 - KB-02_mango_L002_R1.fastq.gz | |
112 - KB-02_mango_L002_R2.fastq.gz | |
113 | |
114 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`). | |
115 | |
116 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet. | |
117 | |
118 \ | |
119 | |
120 | |
121 ### Output | |
122 | |
123 --- | |
124 | |
125 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report. | |
126 | |
127 \ | |
128 | |
129 | |
130 ### Computational resources | |
131 | |
132 --- | |
133 | |
134 The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option. | |
135 | |
136 \ | |
137 | |
138 | |
139 Example: | |
140 | |
141 ```bash | |
142 cpipes \ | |
143 --pipeline bettercallsal \ | |
144 --input /path/to/bettercallsal_sim_reads \ | |
145 --output /path/to/bettercallsal_sim_reads_output \ | |
146 --bcs_root_dbdir /path/to/PDG000000002.2537 | |
147 --kmaalign_ignorequals \ | |
148 --max_cpus 5 \ | |
149 -profile stdkondagac \ | |
150 -resume | |
151 ``` | |
152 | |
153 \ | |
154 | |
155 | |
156 ### Runtime profiles | |
157 | |
158 --- | |
159 | |
160 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure. | |
161 | |
162 \ | |
163 | |
164 | |
165 Example: | |
166 | |
167 ```bash | |
168 cd /data/scratch/$USER | |
169 mkdir nf-cpipes | |
170 cd nf-cpipes | |
171 cpipes \ | |
172 --pipeline bettercallsal \ | |
173 --input /path/to/fastq_pass_dir \ | |
174 --output /path/to/where/output/should/go \ | |
175 -profile your_institution | |
176 ``` | |
177 | |
178 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files. | |
179 | |
180 \ | |
181 | |
182 | |
183 ### `your_institution.config` | |
184 | |
185 --- | |
186 | |
187 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines: | |
188 | |
189 \ | |
190 | |
191 | |
192 ```groovy | |
193 your_institution { | |
194 process.executor = 'sge' | |
195 process.queue = 'normal.q' | |
196 singularity.enabled = false | |
197 singularity.autoMounts = true | |
198 docker.enabled = false | |
199 params.enable_conda = true | |
200 conda.enabled = true | |
201 conda.useMicromamba = true | |
202 params.enable_module = false | |
203 } | |
204 ``` | |
205 | |
206 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion. | |
207 | |
208 \ | |
209 | |
210 | |
211 ### Cloud computing | |
212 | |
213 --- | |
214 | |
215 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html): | |
216 | |
217 \ | |
218 | |
219 | |
220 Example: | |
221 | |
222 ```groovy | |
223 my_aws_batch { | |
224 executor = 'awsbatch' | |
225 queue = 'my-batch-queue' | |
226 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' | |
227 aws.batch.region = 'us-east-1' | |
228 singularity.enabled = false | |
229 singularity.autoMounts = true | |
230 docker.enabled = true | |
231 params.conda_enabled = false | |
232 params.enable_module = false | |
233 } | |
234 ``` | |
235 | |
236 \ | |
237 | |
238 | |
239 ### Example data | |
240 | |
241 --- | |
242 | |
243 After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions. | |
244 | |
245 - Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB). | |
246 - Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads. | |
247 - Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.2537](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2537.tar.bz2) version of the database (~ 37 GB). | |
248 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html). | |
249 | |
250 Now run the workflow by ignoring quality values since these are simulated base qualities: | |
251 | |
252 \ | |
253 | |
254 | |
255 ```bash | |
256 cpipes \ | |
257 --pipeline bettercallsal \ | |
258 --input /path/to/bettercallsal_sim_reads \ | |
259 --output /path/to/bettercallsal_sim_reads_output \ | |
260 --bcs_root_dbdir /path/to/PDG000000002.2537 | |
261 --kmaalign_ignorequals \ | |
262 -profile stdkondagac \ | |
263 -resume | |
264 ``` | |
265 | |
266 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache. | |
267 | |
268 \ | |
269 | |
270 | |
271 ## Using `sourmash` | |
272 | |
273 Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype. | |
274 | |
275 You can turn **OFF** this feature with `--sourmashsketch_run false` option. | |
276 | |
277 \ | |
278 | |
279 | |
280 ## `bettercallsal` CLI Help | |
281 | |
282 ```text | |
283 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help | |
284 N E X T F L O W ~ version 22.10.0 | |
285 Launching `./bettercallsal/cpipes` [awesome_chandrasekhar] DSL2 - revision: 8da4e11078 | |
286 ================================================================================ | |
287 (o) | |
288 ___ _ __ _ _ __ ___ ___ | |
289 / __|| '_ \ | || '_ \ / _ \/ __| | |
290 | (__ | |_) || || |_) || __/\__ \ | |
291 \___|| .__/ |_|| .__/ \___||___/ | |
292 | | | | | |
293 |_| |_| | |
294 -------------------------------------------------------------------------------- | |
295 A collection of modular pipelines at CFSAN, FDA. | |
296 -------------------------------------------------------------------------------- | |
297 Name : CPIPES | |
298 Author : Kranti Konganti | |
299 Version : 0.5.0 | |
300 Center : CFSAN, FDA. | |
301 ================================================================================ | |
302 | |
303 Workflow : bettercallsal | |
304 | |
305 Author : Kranti Konganti | |
306 | |
307 Version : 0.5.0 | |
308 | |
309 | |
310 Usage : cpipes --pipeline bettercallsal [options] | |
311 | |
312 | |
313 Required : | |
314 | |
315 --input : Absolute path to directory containing FASTQ | |
316 files. The directory should contain only | |
317 FASTQ files as all the files within the | |
318 mentioned directory will be read. Ex: -- | |
319 input /path/to/fastq_pass | |
320 | |
321 --output : Absolute path to directory where all the | |
322 pipeline outputs should be stored. Ex: -- | |
323 output /path/to/output | |
324 | |
325 Other options : | |
326 | |
327 --metadata : Absolute path to metadata CSV file | |
328 containing five mandatory columns: sample, | |
329 fq1,fq2,strandedness,single_end. The fq1 | |
330 and fq2 columns contain absolute paths to | |
331 the FASTQ files. This option can be used in | |
332 place of --input option. This is rare. Ex | |
333 : --metadata samplesheet.csv | |
334 | |
335 --fq_suffix : The suffix of FASTQ files (Unpaired reads | |
336 or R1 reads or Long reads) if an input | |
337 directory is mentioned via --input option. | |
338 Default: .fastq.gz | |
339 | |
340 --fq2_suffix : The suffix of FASTQ files (Paired-end reads | |
341 or R2 reads) if an input directory is | |
342 mentioned via --input option. Default: | |
343 _R2_001.fastq.gz | |
344 | |
345 --fq_filter_by_len : Remove FASTQ reads that are less than this | |
346 many bases. Default: 0 | |
347 | |
348 --fq_strandedness : The strandedness of the sequencing run. | |
349 This is mostly needed if your sequencing | |
350 run is RNA-SEQ. For most of the other runs | |
351 , it is probably safe to use unstranded for | |
352 the option. Default: unstranded | |
353 | |
354 --fq_single_end : SINGLE-END information will be auto- | |
355 detected but this option forces PAIRED-END | |
356 FASTQ files to be treated as SINGLE-END so | |
357 only read 1 information is included in auto | |
358 -generated samplesheet. Default: true | |
359 | |
360 --fq_filename_delim : Delimiter by which the file name is split | |
361 to obtain sample name. Default: _ | |
362 | |
363 --fq_filename_delim_idx : After splitting FASTQ file name by using | |
364 the --fq_filename_delim option, all | |
365 elements before this index (1-based) will | |
366 be joined to create final sample name. | |
367 Default: 1 | |
368 | |
369 --bcs_concat_pe : Concatenate paired-end files. Default: true | |
370 | |
371 --bbmerge_run : Run BBMerge tool. Default: false | |
372 | |
373 --bbmerge_reads : Quit after this many read pairs (-1 means | |
374 all) Default: -1 | |
375 | |
376 --bbmerge_adapters : Absolute UNIX path pointing to the adapters | |
377 file in FASTA format. Default: false | |
378 | |
379 --bbmerge_ziplevel : Set to 1 (lowest) through 9 (max) to change | |
380 compression level; lower compression is | |
381 faster. Default: 1 | |
382 | |
383 --bbmerge_ordered : Output reads in the same order as input. | |
384 Default: false | |
385 | |
386 --bbmerge_qtrim : Trim read ends to remove bases with quality | |
387 below --bbmerge_minq. Trims BEFORE merging | |
388 . Values: t (trim both ends), f (neither | |
389 end), r (right end only), l (left end only | |
390 ). Default: true | |
391 | |
392 --bbmerge_qtrim2 : May be specified instead of --bbmerge_qtrim | |
393 to perform trimming only if merging is | |
394 unsuccesful. then retry merging. Default: | |
395 false | |
396 | |
397 --bbmerge_trimq : Trim quality threshold. This may be comma- | |
398 delimited list (ascending) to try multiple | |
399 values. Default: 10 | |
400 | |
401 --bbmerge_minlength : (ml) Reads shorter than this after trimming | |
402 , but before merging, will be discarded. | |
403 Pairs will be discarded onlyif both are | |
404 shorter. Default: 1 | |
405 | |
406 --bbmerge_tbo : (trimbyoverlap). Trim overlapping reads to | |
407 remove right most (3') non-overlaping | |
408 portion instead of joining Default: false | |
409 | |
410 --bbmerge_minavgquality : (maq). Reads with average quality below | |
411 this after trimming will not be attempted | |
412 to merge. Default: 30 | |
413 | |
414 --bbmerge_trimpolya : Trim trailing poly-A tail from adapter | |
415 output. Only affects outadapter. This also | |
416 trims poly-A followed by poly-G, which | |
417 occurs on NextSeq. Default: true | |
418 | |
419 --bbmerge_pfilter : Ban improbable overlaps. Higher is more | |
420 strict. 0 will disable the filter; 1 will | |
421 allow only perfect overlaps. Default: 1 | |
422 | |
423 --bbmerge_ouq : Calculate best overlap using quality values | |
424 . Default: false | |
425 | |
426 --bbmerge_owq : Calculate best overlap without using | |
427 quality values. Default: true | |
428 | |
429 --bbmerge_strict : Decrease false positive rate and merging | |
430 rate. Default: false | |
431 | |
432 --bbmerge_verystrict : Greatly decrease false positive rate and | |
433 merging rate. Default: false | |
434 | |
435 --bbmerge_ultrastrict : Decrease false positive rate and merging | |
436 rate even more. Default: true | |
437 | |
438 --bbmerge_maxstrict : Maxiamally decrease false positive rate and | |
439 merging rate. Default: false | |
440 | |
441 --bbmerge_loose : Increase false positive rate and merging | |
442 rate. Default: false | |
443 | |
444 --bbmerge_veryloose : Greatly increase false positive rate and | |
445 merging rate. Default: false | |
446 | |
447 --bbmerge_ultraloose : Increase false positive rate and merging | |
448 rate even more. Default: false | |
449 | |
450 --bbmerge_maxloose : Maximally increase false positive rate and | |
451 merging rate. Default: false | |
452 | |
453 --bbmerge_fast : Fastest possible preset. Default: false | |
454 | |
455 --bbmerge_k : Kmer length. 31 (or less) is fastest and | |
456 uses the least memory, but higher values | |
457 may be more accurate. 60 tends to work well | |
458 for 150bp reads. Default: 60 | |
459 | |
460 --bbmerge_prealloc : Pre-allocate memory rather than dynamically | |
461 growing. Faster and more memory-efficient | |
462 for large datasets. A float fraction (0-1) | |
463 may be specified, default 1. Default: true | |
464 | |
465 --fastp_run : Run fastp tool. Default: true | |
466 | |
467 --fastp_failed_out : Specify whether to store reads that cannot | |
468 pass the filters. Default: false | |
469 | |
470 --fastp_merged_out : Specify whether to store merged output or | |
471 not. Default: false | |
472 | |
473 --fastp_overlapped_out : For each read pair, output the overlapped | |
474 region if it has no mismatched base. | |
475 Default: false | |
476 | |
477 --fastp_6 : Indicate that the input is using phred64 | |
478 scoring (it'll be converted to phred33, so | |
479 the output will still be phred33). Default | |
480 : false | |
481 | |
482 --fastp_reads_to_process : Specify how many reads/pairs are to be | |
483 processed. Default value 0 means process | |
484 all reads. Default: 0 | |
485 | |
486 --fastp_fix_mgi_id : The MGI FASTQ ID format is not compatible | |
487 with many BAM operation tools, enable this | |
488 option to fix it. Default: false | |
489 | |
490 --fastp_A : Disable adapter trimming. On by default. | |
491 Default: false | |
492 | |
493 --fastp_adapter_fasta : Specify a FASTA file to trim both read1 and | |
494 read2 (if PE) by all the sequences in this | |
495 FASTA file. Default: false | |
496 | |
497 --fastp_f : Trim how many bases in front of read1. | |
498 Default: 0 | |
499 | |
500 --fastp_t : Trim how many bases at the end of read1. | |
501 Default: 0 | |
502 | |
503 --fastp_b : Max length of read1 after trimming. Default | |
504 : 0 | |
505 | |
506 --fastp_F : Trim how many bases in front of read2. | |
507 Default: 0 | |
508 | |
509 --fastp_T : Trim how many bases at the end of read2. | |
510 Default: 0 | |
511 | |
512 --fastp_B : Max length of read2 after trimming. Default | |
513 : 0 | |
514 | |
515 --fastp_dedup : Enable deduplication to drop the duplicated | |
516 reads/pairs. Default: true | |
517 | |
518 --fastp_dup_calc_accuracy : Accuracy level to calculate duplication (1~ | |
519 6), higher level uses more memory (1G, 2G, | |
520 4G, 8G, 16G, 24G). Default 1 for no-dedup | |
521 mode, and 3 for dedup mode. Default: 6 | |
522 | |
523 --fastp_poly_g_min_len : The minimum length to detect polyG in the | |
524 read tail. Default: 10 | |
525 | |
526 --fastp_G : Disable polyG tail trimming. Default: true | |
527 | |
528 --fastp_x : Enable polyX trimming in 3' ends. Default: | |
529 false | |
530 | |
531 --fastp_poly_x_min_len : The minimum length to detect polyX in the | |
532 read tail. Default: 10 | |
533 | |
534 --fastp_cut_front : Move a sliding window from front (5') to | |
535 tail, drop the bases in the window if its | |
536 mean quality < threshold, stop otherwise. | |
537 Default: true | |
538 | |
539 --fastp_cut_tail : Move a sliding window from tail (3') to | |
540 front, drop the bases in the window if its | |
541 mean quality < threshold, stop otherwise. | |
542 Default: false | |
543 | |
544 --fastp_cut_right : Move a sliding window from tail, drop the | |
545 bases in the window and the right part if | |
546 its mean quality < threshold, and then stop | |
547 . Default: true | |
548 | |
549 --fastp_W : Sliding window size shared by -- | |
550 fastp_cut_front, --fastp_cut_tail and -- | |
551 fastp_cut_right. Default: 20 | |
552 | |
553 --fastp_M : The mean quality requirement shared by -- | |
554 fastp_cut_front, --fastp_cut_tail and -- | |
555 fastp_cut_right. Default: 30 | |
556 | |
557 --fastp_q : The quality value below which a base should | |
558 is not qualified. Default: 30 | |
559 | |
560 --fastp_u : What percent of bases are allowed to be | |
561 unqualified. Default: 40 | |
562 | |
563 --fastp_n : How many N's can a read have. Default: 5 | |
564 | |
565 --fastp_e : If the full reads' average quality is below | |
566 this value, then it is discarded. Default | |
567 : 0 | |
568 | |
569 --fastp_l : Reads shorter than this length will be | |
570 discarded. Default: 35 | |
571 | |
572 --fastp_max_len : Reads longer than this length will be | |
573 discarded. Default: 0 | |
574 | |
575 --fastp_y : Enable low complexity filter. The | |
576 complexity is defined as the percentage of | |
577 bases that are different from its next base | |
578 (base[i] != base[i+1]). Default: true | |
579 | |
580 --fastp_Y : The threshold for low complexity filter (0~ | |
581 100). Ex: A value of 30 means 30% | |
582 complexity is required. Default: 30 | |
583 | |
584 --fastp_U : Enable Unique Molecular Identifier (UMI) | |
585 pre-processing. Default: false | |
586 | |
587 --fastp_umi_loc : Specify the location of UMI, can be one of | |
588 index1/index2/read1/read2/per_index/ | |
589 per_read. Default: false | |
590 | |
591 --fastp_umi_len : If the UMI is in read1 or read2, its length | |
592 should be provided. Default: false | |
593 | |
594 --fastp_umi_prefix : If specified, an underline will be used to | |
595 connect prefix and UMI (i.e. prefix=UMI, | |
596 UMI=AATTCG, final=UMI_AATTCG). Default: | |
597 false | |
598 | |
599 --fastp_umi_skip : If the UMI is in read1 or read2, fastp can | |
600 skip several bases following the UMI. | |
601 Default: false | |
602 | |
603 --fastp_p : Enable overrepresented sequence analysis. | |
604 Default: true | |
605 | |
606 --fastp_P : One in this many number of reads will be | |
607 computed for overrepresentation analysis (1 | |
608 ~10000), smaller is slower. Default: 20 | |
609 | |
610 --fastp_use_custom_adapaters : Use custom adapter FASTA with fastp on top | |
611 of built-in adapter sequence auto-detection | |
612 . Enabling this option will attempt to find | |
613 and remove all possible Illumina adapter | |
614 and primer sequences but will make the | |
615 workflow run slow. Default: false | |
616 | |
617 --mashscreen_run : Run `mash screen` tool. Default: true | |
618 | |
619 --mashscreen_w : Winner-takes-all strategy for identity | |
620 estimates. After counting hashes for each | |
621 query, hashes that appear in multiple | |
622 queries will be removed from all except the | |
623 one with the best identity (ties broken by | |
624 larger query), and other identities will | |
625 be reduced. This removes output redundancy | |
626 , providing a rough compositional outline | |
627 . Default: false | |
628 | |
629 --mashscreen_i : Minimum identity to report. Inclusive | |
630 unless set to zero, in which case only | |
631 identities greater than zero (i.e. with at | |
632 least one shared hash) will be reported. | |
633 Set to -1 to output everything. (-1-1). | |
634 Default: false | |
635 | |
636 --mashscreen_v : Maximum p-value to report (0-1). Default: | |
637 false | |
638 | |
639 --tuspy_run : Run the get_top_unique_mash_hits_genomes.py | |
640 script. Default: true | |
641 | |
642 --tuspy_s : Absolute UNIX path to metadata text file | |
643 with the field separator, | and 5 fields: | |
644 serotype|asm_lvl|asm_url|snp_cluster_idEx: | |
645 serotype=Derby,antigen_formula=4:f,g:-| | |
646 Scaffold|402440|ftp://...|PDS000096654.2. | |
647 Mentioning this option will create a pickle | |
648 file for the provided metadata and exits. | |
649 Default: false | |
650 | |
651 --tuspy_m : Absolute UNIX path to mash screen results | |
652 file. Default: false | |
653 | |
654 --tuspy_ps : Absolute UNIX Path to serialized metadata | |
655 object in a pickle file. Default: /hpc/db/ | |
656 bettercallsal/latest/index_metadata/ | |
657 per_snp_cluster.ACC2SERO.pickle | |
658 | |
659 --tuspy_gd : Absolute UNIX Path to directory containing | |
660 gzipped genome FASTA files. Default: /hpc/ | |
661 db/bettercallsal/latest/scaffold_genomes | |
662 | |
663 --tuspy_gds : Genome FASTA file suffix to search for in | |
664 the genome directory. Default: | |
665 _scaffolded_genomic.fna.gz | |
666 | |
667 --tuspy_n : Return up to this many number of top N | |
668 unique genome accession hits. Default: 10 | |
669 | |
670 --sourmashsketch_run : Run `sourmash sketch dna` tool. Default: | |
671 true | |
672 | |
673 --sourmashsketch_mode : Select which type of signatures to be | |
674 created: dna, protein, fromfile or | |
675 translate. Default: dna | |
676 | |
677 --sourmashsketch_p : Signature parameters to use. Default: abund | |
678 ,scaled=1000,k=51,k=61,k=71 | |
679 | |
680 --sourmashsketch_file : <path> A text file containing a list of | |
681 sequence files to load. Default: false | |
682 | |
683 --sourmashsketch_f : Recompute signatures even if the file | |
684 exists. Default: false | |
685 | |
686 --sourmashsketch_merge : Merge all input files into one signature | |
687 file with the specified name. Default: | |
688 false | |
689 | |
690 --sourmashsketch_singleton : Compute a signature for each sequence | |
691 record individually. Default: true | |
692 | |
693 --sourmashsketch_name : Name the signature generated from each file | |
694 after the first record in the file. | |
695 Default: false | |
696 | |
697 --sourmashsketch_randomize : Shuffle the list of input files randomly. | |
698 Default: false | |
699 | |
700 --sourmashgather_run : Run `sourmash gather` tool. Default: true | |
701 | |
702 --sourmashgather_n : Number of results to report. By default, | |
703 will terminate at --sourmashgather_thr_bp | |
704 value. Default: false | |
705 | |
706 --sourmashgather_thr_bp : Reporting threshold (in bp) for estimated | |
707 overlap with remaining query. Default: | |
708 false | |
709 | |
710 --sourmashgather_ignoreabn : Do NOT use k-mer abundances if present. | |
711 Default: false | |
712 | |
713 --sourmashgather_prefetch : Use prefetch before gather. Default: false | |
714 | |
715 --sourmashgather_noprefetch : Do not use prefetch before gather. Default | |
716 : false | |
717 | |
718 --sourmashgather_ani_ci : Output confidence intervals for ANI | |
719 estimates. Default: true | |
720 | |
721 --sourmashgather_k : The k-mer size to select. Default: 71 | |
722 | |
723 --sourmashgather_protein : Choose a protein signature. Default: false | |
724 | |
725 --sourmashgather_noprotein : Do not choose a protein signature. Default | |
726 : false | |
727 | |
728 --sourmashgather_dayhoff : Choose Dayhoff-encoded amino acid | |
729 signatures. Default: false | |
730 | |
731 --sourmashgather_nodayhoff : Do not choose Dayhoff-encoded amino acid | |
732 signatures. Default: false | |
733 | |
734 --sourmashgather_hp : Choose hydrophobic-polar-encoded amino acid | |
735 signatures. Default: false | |
736 | |
737 --sourmashgather_nohp : Do not choose hydrophobic-polar-encoded | |
738 amino acid signatures. Default: false | |
739 | |
740 --sourmashgather_dna : Choose DNA signature. Default: true | |
741 | |
742 --sourmashgather_nodna : Do not choose DNA signature. Default: false | |
743 | |
744 --sourmashgather_scaled : Scaled value should be between 100 and 1e6 | |
745 . Default: false | |
746 | |
747 --sourmashgather_inc_pat : Search only signatures that match this | |
748 pattern in name, filename, or md5. Default | |
749 : false | |
750 | |
751 --sourmashgather_exc_pat : Search only signatures that do not match | |
752 this pattern in name, filename, or md5. | |
753 Default: false | |
754 | |
755 --sourmashsearch_run : Run `sourmash search` tool. Default: false | |
756 | |
757 --sourmashsearch_n : Number of results to report. By default, | |
758 will terminate at --sourmashsearch_thr | |
759 value. Default: false | |
760 | |
761 --sourmashsearch_thr : Reporting threshold (similarity) to return | |
762 results. Default: 0 | |
763 | |
764 --sourmashsearch_contain : Score based on containment rather than | |
765 similarity. Default: false | |
766 | |
767 --sourmashsearch_maxcontain : Score based on max containment rather than | |
768 similarity. Default: false | |
769 | |
770 --sourmashsearch_ignoreabn : Do NOT use k-mer abundances if present. | |
771 Default: true | |
772 | |
773 --sourmashsearch_ani_ci : Output confidence intervals for ANI | |
774 estimates. Default: false | |
775 | |
776 --sourmashsearch_k : The k-mer size to select. Default: 71 | |
777 | |
778 --sourmashsearch_protein : Choose a protein signature. Default: false | |
779 | |
780 --sourmashsearch_noprotein : Do not choose a protein signature. Default | |
781 : false | |
782 | |
783 --sourmashsearch_dayhoff : Choose Dayhoff-encoded amino acid | |
784 signatures. Default: false | |
785 | |
786 --sourmashsearch_nodayhoff : Do not choose Dayhoff-encoded amino acid | |
787 signatures. Default: false | |
788 | |
789 --sourmashsearch_hp : Choose hydrophobic-polar-encoded amino acid | |
790 signatures. Default: false | |
791 | |
792 --sourmashsearch_nohp : Do not choose hydrophobic-polar-encoded | |
793 amino acid signatures. Default: false | |
794 | |
795 --sourmashsearch_dna : Choose DNA signature. Default: true | |
796 | |
797 --sourmashsearch_nodna : Do not choose DNA signature. Default: false | |
798 | |
799 --sourmashsearch_scaled : Scaled value should be between 100 and 1e6 | |
800 . Default: false | |
801 | |
802 --sourmashsearch_inc_pat : Search only signatures that match this | |
803 pattern in name, filename, or md5. Default | |
804 : false | |
805 | |
806 --sourmashsearch_exc_pat : Search only signatures that do not match | |
807 this pattern in name, filename, or md5. | |
808 Default: false | |
809 | |
810 --sfhpy_run : Run the sourmash_filter_hits.py script. | |
811 Default: true | |
812 | |
813 --sfhpy_fcn : Column name by which filtering of rows | |
814 should be applied. Default: f_match | |
815 | |
816 --sfhpy_fcv : Remove genomes whose match with the query | |
817 FASTQ is less than this much. Default: 0.1 | |
818 | |
819 --sfhpy_gt : Apply greather than or equal to condition | |
820 on numeric values of --sfhpy_fcn column. | |
821 Default: true | |
822 | |
823 --sfhpy_lt : Apply less than or equal to condition on | |
824 numeric values of --sfhpy_fcn column. | |
825 Default: false | |
826 | |
827 --kmaindex_run : Run kma index tool. Default: true | |
828 | |
829 --kmaindex_t_db : Add to existing DB. Default: false | |
830 | |
831 --kmaindex_k : k-mer size. Default: 31 | |
832 | |
833 --kmaindex_m : Minimizer size. Default: false | |
834 | |
835 --kmaindex_hc : Homopolymer compression. Default: false | |
836 | |
837 --kmaindex_ML : Minimum length of templates. Defaults to -- | |
838 kmaindex_k Default: false | |
839 | |
840 --kmaindex_ME : Mega DB. Default: false | |
841 | |
842 --kmaindex_Sparse : Make Sparse DB. Default: false | |
843 | |
844 --kmaindex_ht : Homology template. Default: false | |
845 | |
846 --kmaindex_hq : Homology query. Default: false | |
847 | |
848 --kmaindex_and : Both homology thresholds have to reach. | |
849 Default: false | |
850 | |
851 --kmaindex_nbp : No bias print. Default: false | |
852 | |
853 --kmaalign_run : Run kma tool. Default: true | |
854 | |
855 --kmaalign_int : Input file has interleaved reads. Default | |
856 : false | |
857 | |
858 --kmaalign_ef : Output additional features. Default: false | |
859 | |
860 --kmaalign_vcf : Output vcf file. 2 to apply FT. Default: | |
861 false | |
862 | |
863 --kmaalign_sam : Output SAM, 4/2096 for mapped/aligned. | |
864 Default: false | |
865 | |
866 --kmaalign_nc : No consensus file. Default: true | |
867 | |
868 --kmaalign_na : No aln file. Default: true | |
869 | |
870 --kmaalign_nf : No frag file. Default: true | |
871 | |
872 --kmaalign_a : Output all template mappings. Default: | |
873 false | |
874 | |
875 --kmaalign_and : Use both -mrs and p-value on consensus. | |
876 Default: false | |
877 | |
878 --kmaalign_oa : Use neither -mrs or p-value on consensus. | |
879 Default: false | |
880 | |
881 --kmaalign_bc : Minimum support to call bases. Default: | |
882 false | |
883 | |
884 --kmaalign_bcNano : Altered indel calling for ONT data. Default | |
885 : false | |
886 | |
887 --kmaalign_bcd : Minimum depth to call bases. Default: false | |
888 | |
889 --kmaalign_bcg : Maintain insignificant gaps. Default: false | |
890 | |
891 --kmaalign_ID : Minimum consensus ID. Default: false | |
892 | |
893 --kmaalign_md : Minimum depth. Default: false | |
894 | |
895 --kmaalign_dense : Skip insertion in consensus. Default: false | |
896 | |
897 --kmaalign_ref_fsa : Use Ns on indels. Default: false | |
898 | |
899 --kmaalign_Mt1 : Map everything to one template. Default: | |
900 false | |
901 | |
902 --kmaalign_1t1 : Map one query to one template. Default: | |
903 false | |
904 | |
905 --kmaalign_mrs : Minimum relative alignment score. Default: | |
906 false | |
907 | |
908 --kmaalign_mrc : Minimum query coverage. Default: 0.99 | |
909 | |
910 --kmaalign_mp : Minimum phred score of trailing and leading | |
911 bases. Default: 30 | |
912 | |
913 --kmaalign_mq : Set the minimum mapping quality. Default: | |
914 false | |
915 | |
916 --kmaalign_eq : Minimum average quality score. Default: 30 | |
917 | |
918 --kmaalign_5p : Trim 5 prime by this many bases. Default: | |
919 false | |
920 | |
921 --kmaalign_3p : Trim 3 prime by this many bases Default: | |
922 false | |
923 | |
924 --kmaalign_apm : Sets both -pm and -fpm Default: false | |
925 | |
926 --kmaalign_cge : Set CGE penalties and rewards Default: | |
927 false | |
928 | |
929 --salmonidx_run : Run `salmon index` tool. Default: true | |
930 | |
931 --salmonidx_k : The size of k-mers that should be used for | |
932 the quasi index. Default: false | |
933 | |
934 --salmonidx_gencode : This flag will expect the input transcript | |
935 FASTA to be in GENCODE format, and will | |
936 split the transcript name at the first `|` | |
937 character. These reduced names will be used | |
938 in the output and when looking for these | |
939 transcripts in a gene to transcript GTF. | |
940 Default: false | |
941 | |
942 --salmonidx_features : This flag will expect the input reference | |
943 to be in the tsv file format, and will | |
944 split the feature name at the first `tab` | |
945 character. These reduced names will be used | |
946 in the output and when looking for the | |
947 sequence of the features. GTF. Default: | |
948 false | |
949 | |
950 --salmonidx_keepDuplicates : This flag will disable the default indexing | |
951 behavior of discarding sequence-identical | |
952 duplicate transcripts. If this flag is | |
953 passed then duplicate transcripts that | |
954 appear in the input will be retained and | |
955 quantified separately. Default: false | |
956 | |
957 --salmonidx_keepFixedFasta : Retain the fixed fasta file (without short | |
958 transcripts and duplicates, clipped, etc.) | |
959 generated during indexing. Default: false | |
960 | |
961 --salmonidx_filterSize : The size of the Bloom filter that will be | |
962 used by TwoPaCo during indexing. The filter | |
963 will be of size 2^{filterSize}. A value of | |
964 -1 means that the filter size will be | |
965 automatically set based on the number of | |
966 distinct k-mers in the input, as estimated | |
967 by nthll. Default: false | |
968 | |
969 --salmonidx_sparse : Build the index using a sparse sampling of | |
970 k-mer positions This will require less | |
971 memory (especially during quantification), | |
972 but will take longer to constructand can | |
973 slow down mapping / alignment. Default: | |
974 false | |
975 | |
976 --salmonidx_n : Do not clip poly-A tails from the ends of | |
977 target sequences. Default: false | |
978 | |
979 --gsrpy_run : Run the gen_salmon_res_table.py script. | |
980 Default: true | |
981 | |
982 --gsrpy_url : Generate an additional column in final | |
983 results table which links out to NCBI | |
984 Pathogens Isolate Browser. Default: true | |
985 | |
986 Help options : | |
987 | |
988 --help : Display this message. | |
989 | |
990 ``` |