comparison 0.5.0/readme/bettercallsal.md @ 1:365849f031fd

"planemo upload"
author kkonganti
date Mon, 05 Jun 2023 18:48:51 -0400
parents
children
comparison
equal deleted inserted replaced
0:a4b1ee4b68b1 1:365849f031fd
1 # bettercallsal
2
3 `bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture.
4
5 \
6  
7
8 <!-- TOC -->
9
10 - [Minimum Requirements](#minimum-requirements)
11 - [Usage and Examples](#usage-and-examples)
12 - [Database](#database)
13 - [Input](#input)
14 - [Output](#output)
15 - [Computational resources](#computational-resources)
16 - [Runtime profiles](#runtime-profiles)
17 - [your_institution.config](#your_institutionconfig)
18 - [Cloud computing](#cloud-computing)
19 - [Example data](#example-data)
20 - [Using sourmash](#using-sourmash)
21 - [bettercallsal CLI Help](#bettercallsal-cli-help)
22
23 <!-- /TOC -->
24
25 \
26 &nbsp;
27
28 ## Minimum Requirements
29
30 1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow).
31 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
32 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
33 2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`.
34 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
35 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`.
36 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
37 3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
38
39 \
40 &nbsp;
41
42 ## Usage and Examples
43
44 Clone or download this repository and then call `cpipes`.
45
46 ```bash
47 cpipes --pipeline bettercallsal [options]
48 ```
49
50 \
51 &nbsp;
52
53 **Example**: Run the default `bettercallsal` pipeline in single-end mode.
54
55 ```bash
56 cd /data/scratch/$USER
57 mkdir nf-cpipes
58 cd nf-cpipes
59 cpipes
60 --pipeline bettercallsal \
61 --input /path/to/illumina/fastq/dir \
62 --output /path/to/output \
63 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db
64 ```
65
66 \
67 &nbsp;
68
69 **Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [preprint](https://www.biorxiv.org/content/10.1101/2023.04.06.535929v1.full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`.
70
71 ```bash
72 cd /data/scratch/$USER
73 mkdir nf-cpipes
74 cd nf-cpipes
75 cpipes \
76 --pipeline bettercallsal \
77 --input /path/to/illumina/fastq/dir \
78 --output /path/to/output \
79 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db \
80 --fq_single_end false \
81 --fq_suffix '_R1_001.fastq.gz'
82 ```
83
84 \
85 &nbsp;
86
87 ### Database
88
89 ---
90
91 The successful run of the workflow requires certain database flat files specific for the workflow.
92
93 Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release.
94
95 &nbsp;
96
97 ### Input
98
99 ---
100
101 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
102
103 For example, if the directory contains FASTQ files as shown below:
104
105 - KB-01_apple_L001_R1.fastq.gz
106 - KB-01_apple_L001_R2.fastq.gz
107 - KB-01_apple_L002_R1.fastq.gz
108 - KB-01_apple_L002_R2.fastq.gz
109 - KB-02_mango_L001_R1.fastq.gz
110 - KB-02_mango_L001_R2.fastq.gz
111 - KB-02_mango_L002_R1.fastq.gz
112 - KB-02_mango_L002_R2.fastq.gz
113
114 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
115
116 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
117
118 \
119 &nbsp;
120
121 ### Output
122
123 ---
124
125 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
126
127 \
128 &nbsp;
129
130 ### Computational resources
131
132 ---
133
134 The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.
135
136 \
137 &nbsp;
138
139 Example:
140
141 ```bash
142 cpipes \
143 --pipeline bettercallsal \
144 --input /path/to/bettercallsal_sim_reads \
145 --output /path/to/bettercallsal_sim_reads_output \
146 --bcs_root_dbdir /path/to/PDG000000002.2537
147 --kmaalign_ignorequals \
148 --max_cpus 5 \
149 -profile stdkondagac \
150 -resume
151 ```
152
153 \
154 &nbsp;
155
156 ### Runtime profiles
157
158 ---
159
160 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
161
162 \
163 &nbsp;
164
165 Example:
166
167 ```bash
168 cd /data/scratch/$USER
169 mkdir nf-cpipes
170 cd nf-cpipes
171 cpipes \
172 --pipeline bettercallsal \
173 --input /path/to/fastq_pass_dir \
174 --output /path/to/where/output/should/go \
175 -profile your_institution
176 ```
177
178 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.
179
180 \
181 &nbsp;
182
183 ### `your_institution.config`
184
185 ---
186
187 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
188
189 \
190 &nbsp;
191
192 ```groovy
193 your_institution {
194 process.executor = 'sge'
195 process.queue = 'normal.q'
196 singularity.enabled = false
197 singularity.autoMounts = true
198 docker.enabled = false
199 params.enable_conda = true
200 conda.enabled = true
201 conda.useMicromamba = true
202 params.enable_module = false
203 }
204 ```
205
206 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
207
208 \
209 &nbsp;
210
211 ### Cloud computing
212
213 ---
214
215 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
216
217 \
218 &nbsp;
219
220 Example:
221
222 ```groovy
223 my_aws_batch {
224 executor = 'awsbatch'
225 queue = 'my-batch-queue'
226 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
227 aws.batch.region = 'us-east-1'
228 singularity.enabled = false
229 singularity.autoMounts = true
230 docker.enabled = true
231 params.conda_enabled = false
232 params.enable_module = false
233 }
234 ```
235
236 \
237 &nbsp;
238
239 ### Example data
240
241 ---
242
243 After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions.
244
245 - Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB).
246 - Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads.
247 - Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.2537](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2537.tar.bz2) version of the database (~ 37 GB).
248 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html).
249
250 Now run the workflow by ignoring quality values since these are simulated base qualities:
251
252 \
253 &nbsp;
254
255 ```bash
256 cpipes \
257 --pipeline bettercallsal \
258 --input /path/to/bettercallsal_sim_reads \
259 --output /path/to/bettercallsal_sim_reads_output \
260 --bcs_root_dbdir /path/to/PDG000000002.2537
261 --kmaalign_ignorequals \
262 -profile stdkondagac \
263 -resume
264 ```
265
266 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
267
268 \
269 &nbsp;
270
271 ## Using `sourmash`
272
273 Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype.
274
275 You can turn **OFF** this feature with `--sourmashsketch_run false` option.
276
277 \
278 &nbsp;
279
280 ## `bettercallsal` CLI Help
281
282 ```text
283 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help
284 N E X T F L O W ~ version 22.10.0
285 Launching `./bettercallsal/cpipes` [awesome_chandrasekhar] DSL2 - revision: 8da4e11078
286 ================================================================================
287 (o)
288 ___ _ __ _ _ __ ___ ___
289 / __|| '_ \ | || '_ \ / _ \/ __|
290 | (__ | |_) || || |_) || __/\__ \
291 \___|| .__/ |_|| .__/ \___||___/
292 | | | |
293 |_| |_|
294 --------------------------------------------------------------------------------
295 A collection of modular pipelines at CFSAN, FDA.
296 --------------------------------------------------------------------------------
297 Name : CPIPES
298 Author : Kranti Konganti
299 Version : 0.5.0
300 Center : CFSAN, FDA.
301 ================================================================================
302
303 Workflow : bettercallsal
304
305 Author : Kranti Konganti
306
307 Version : 0.5.0
308
309
310 Usage : cpipes --pipeline bettercallsal [options]
311
312
313 Required :
314
315 --input : Absolute path to directory containing FASTQ
316 files. The directory should contain only
317 FASTQ files as all the files within the
318 mentioned directory will be read. Ex: --
319 input /path/to/fastq_pass
320
321 --output : Absolute path to directory where all the
322 pipeline outputs should be stored. Ex: --
323 output /path/to/output
324
325 Other options :
326
327 --metadata : Absolute path to metadata CSV file
328 containing five mandatory columns: sample,
329 fq1,fq2,strandedness,single_end. The fq1
330 and fq2 columns contain absolute paths to
331 the FASTQ files. This option can be used in
332 place of --input option. This is rare. Ex
333 : --metadata samplesheet.csv
334
335 --fq_suffix : The suffix of FASTQ files (Unpaired reads
336 or R1 reads or Long reads) if an input
337 directory is mentioned via --input option.
338 Default: .fastq.gz
339
340 --fq2_suffix : The suffix of FASTQ files (Paired-end reads
341 or R2 reads) if an input directory is
342 mentioned via --input option. Default:
343 _R2_001.fastq.gz
344
345 --fq_filter_by_len : Remove FASTQ reads that are less than this
346 many bases. Default: 0
347
348 --fq_strandedness : The strandedness of the sequencing run.
349 This is mostly needed if your sequencing
350 run is RNA-SEQ. For most of the other runs
351 , it is probably safe to use unstranded for
352 the option. Default: unstranded
353
354 --fq_single_end : SINGLE-END information will be auto-
355 detected but this option forces PAIRED-END
356 FASTQ files to be treated as SINGLE-END so
357 only read 1 information is included in auto
358 -generated samplesheet. Default: true
359
360 --fq_filename_delim : Delimiter by which the file name is split
361 to obtain sample name. Default: _
362
363 --fq_filename_delim_idx : After splitting FASTQ file name by using
364 the --fq_filename_delim option, all
365 elements before this index (1-based) will
366 be joined to create final sample name.
367 Default: 1
368
369 --bcs_concat_pe : Concatenate paired-end files. Default: true
370
371 --bbmerge_run : Run BBMerge tool. Default: false
372
373 --bbmerge_reads : Quit after this many read pairs (-1 means
374 all) Default: -1
375
376 --bbmerge_adapters : Absolute UNIX path pointing to the adapters
377 file in FASTA format. Default: false
378
379 --bbmerge_ziplevel : Set to 1 (lowest) through 9 (max) to change
380 compression level; lower compression is
381 faster. Default: 1
382
383 --bbmerge_ordered : Output reads in the same order as input.
384 Default: false
385
386 --bbmerge_qtrim : Trim read ends to remove bases with quality
387 below --bbmerge_minq. Trims BEFORE merging
388 . Values: t (trim both ends), f (neither
389 end), r (right end only), l (left end only
390 ). Default: true
391
392 --bbmerge_qtrim2 : May be specified instead of --bbmerge_qtrim
393 to perform trimming only if merging is
394 unsuccesful. then retry merging. Default:
395 false
396
397 --bbmerge_trimq : Trim quality threshold. This may be comma-
398 delimited list (ascending) to try multiple
399 values. Default: 10
400
401 --bbmerge_minlength : (ml) Reads shorter than this after trimming
402 , but before merging, will be discarded.
403 Pairs will be discarded onlyif both are
404 shorter. Default: 1
405
406 --bbmerge_tbo : (trimbyoverlap). Trim overlapping reads to
407 remove right most (3') non-overlaping
408 portion instead of joining Default: false
409
410 --bbmerge_minavgquality : (maq). Reads with average quality below
411 this after trimming will not be attempted
412 to merge. Default: 30
413
414 --bbmerge_trimpolya : Trim trailing poly-A tail from adapter
415 output. Only affects outadapter. This also
416 trims poly-A followed by poly-G, which
417 occurs on NextSeq. Default: true
418
419 --bbmerge_pfilter : Ban improbable overlaps. Higher is more
420 strict. 0 will disable the filter; 1 will
421 allow only perfect overlaps. Default: 1
422
423 --bbmerge_ouq : Calculate best overlap using quality values
424 . Default: false
425
426 --bbmerge_owq : Calculate best overlap without using
427 quality values. Default: true
428
429 --bbmerge_strict : Decrease false positive rate and merging
430 rate. Default: false
431
432 --bbmerge_verystrict : Greatly decrease false positive rate and
433 merging rate. Default: false
434
435 --bbmerge_ultrastrict : Decrease false positive rate and merging
436 rate even more. Default: true
437
438 --bbmerge_maxstrict : Maxiamally decrease false positive rate and
439 merging rate. Default: false
440
441 --bbmerge_loose : Increase false positive rate and merging
442 rate. Default: false
443
444 --bbmerge_veryloose : Greatly increase false positive rate and
445 merging rate. Default: false
446
447 --bbmerge_ultraloose : Increase false positive rate and merging
448 rate even more. Default: false
449
450 --bbmerge_maxloose : Maximally increase false positive rate and
451 merging rate. Default: false
452
453 --bbmerge_fast : Fastest possible preset. Default: false
454
455 --bbmerge_k : Kmer length. 31 (or less) is fastest and
456 uses the least memory, but higher values
457 may be more accurate. 60 tends to work well
458 for 150bp reads. Default: 60
459
460 --bbmerge_prealloc : Pre-allocate memory rather than dynamically
461 growing. Faster and more memory-efficient
462 for large datasets. A float fraction (0-1)
463 may be specified, default 1. Default: true
464
465 --fastp_run : Run fastp tool. Default: true
466
467 --fastp_failed_out : Specify whether to store reads that cannot
468 pass the filters. Default: false
469
470 --fastp_merged_out : Specify whether to store merged output or
471 not. Default: false
472
473 --fastp_overlapped_out : For each read pair, output the overlapped
474 region if it has no mismatched base.
475 Default: false
476
477 --fastp_6 : Indicate that the input is using phred64
478 scoring (it'll be converted to phred33, so
479 the output will still be phred33). Default
480 : false
481
482 --fastp_reads_to_process : Specify how many reads/pairs are to be
483 processed. Default value 0 means process
484 all reads. Default: 0
485
486 --fastp_fix_mgi_id : The MGI FASTQ ID format is not compatible
487 with many BAM operation tools, enable this
488 option to fix it. Default: false
489
490 --fastp_A : Disable adapter trimming. On by default.
491 Default: false
492
493 --fastp_adapter_fasta : Specify a FASTA file to trim both read1 and
494 read2 (if PE) by all the sequences in this
495 FASTA file. Default: false
496
497 --fastp_f : Trim how many bases in front of read1.
498 Default: 0
499
500 --fastp_t : Trim how many bases at the end of read1.
501 Default: 0
502
503 --fastp_b : Max length of read1 after trimming. Default
504 : 0
505
506 --fastp_F : Trim how many bases in front of read2.
507 Default: 0
508
509 --fastp_T : Trim how many bases at the end of read2.
510 Default: 0
511
512 --fastp_B : Max length of read2 after trimming. Default
513 : 0
514
515 --fastp_dedup : Enable deduplication to drop the duplicated
516 reads/pairs. Default: true
517
518 --fastp_dup_calc_accuracy : Accuracy level to calculate duplication (1~
519 6), higher level uses more memory (1G, 2G,
520 4G, 8G, 16G, 24G). Default 1 for no-dedup
521 mode, and 3 for dedup mode. Default: 6
522
523 --fastp_poly_g_min_len : The minimum length to detect polyG in the
524 read tail. Default: 10
525
526 --fastp_G : Disable polyG tail trimming. Default: true
527
528 --fastp_x : Enable polyX trimming in 3' ends. Default:
529 false
530
531 --fastp_poly_x_min_len : The minimum length to detect polyX in the
532 read tail. Default: 10
533
534 --fastp_cut_front : Move a sliding window from front (5') to
535 tail, drop the bases in the window if its
536 mean quality < threshold, stop otherwise.
537 Default: true
538
539 --fastp_cut_tail : Move a sliding window from tail (3') to
540 front, drop the bases in the window if its
541 mean quality < threshold, stop otherwise.
542 Default: false
543
544 --fastp_cut_right : Move a sliding window from tail, drop the
545 bases in the window and the right part if
546 its mean quality < threshold, and then stop
547 . Default: true
548
549 --fastp_W : Sliding window size shared by --
550 fastp_cut_front, --fastp_cut_tail and --
551 fastp_cut_right. Default: 20
552
553 --fastp_M : The mean quality requirement shared by --
554 fastp_cut_front, --fastp_cut_tail and --
555 fastp_cut_right. Default: 30
556
557 --fastp_q : The quality value below which a base should
558 is not qualified. Default: 30
559
560 --fastp_u : What percent of bases are allowed to be
561 unqualified. Default: 40
562
563 --fastp_n : How many N's can a read have. Default: 5
564
565 --fastp_e : If the full reads' average quality is below
566 this value, then it is discarded. Default
567 : 0
568
569 --fastp_l : Reads shorter than this length will be
570 discarded. Default: 35
571
572 --fastp_max_len : Reads longer than this length will be
573 discarded. Default: 0
574
575 --fastp_y : Enable low complexity filter. The
576 complexity is defined as the percentage of
577 bases that are different from its next base
578 (base[i] != base[i+1]). Default: true
579
580 --fastp_Y : The threshold for low complexity filter (0~
581 100). Ex: A value of 30 means 30%
582 complexity is required. Default: 30
583
584 --fastp_U : Enable Unique Molecular Identifier (UMI)
585 pre-processing. Default: false
586
587 --fastp_umi_loc : Specify the location of UMI, can be one of
588 index1/index2/read1/read2/per_index/
589 per_read. Default: false
590
591 --fastp_umi_len : If the UMI is in read1 or read2, its length
592 should be provided. Default: false
593
594 --fastp_umi_prefix : If specified, an underline will be used to
595 connect prefix and UMI (i.e. prefix=UMI,
596 UMI=AATTCG, final=UMI_AATTCG). Default:
597 false
598
599 --fastp_umi_skip : If the UMI is in read1 or read2, fastp can
600 skip several bases following the UMI.
601 Default: false
602
603 --fastp_p : Enable overrepresented sequence analysis.
604 Default: true
605
606 --fastp_P : One in this many number of reads will be
607 computed for overrepresentation analysis (1
608 ~10000), smaller is slower. Default: 20
609
610 --fastp_use_custom_adapaters : Use custom adapter FASTA with fastp on top
611 of built-in adapter sequence auto-detection
612 . Enabling this option will attempt to find
613 and remove all possible Illumina adapter
614 and primer sequences but will make the
615 workflow run slow. Default: false
616
617 --mashscreen_run : Run `mash screen` tool. Default: true
618
619 --mashscreen_w : Winner-takes-all strategy for identity
620 estimates. After counting hashes for each
621 query, hashes that appear in multiple
622 queries will be removed from all except the
623 one with the best identity (ties broken by
624 larger query), and other identities will
625 be reduced. This removes output redundancy
626 , providing a rough compositional outline
627 . Default: false
628
629 --mashscreen_i : Minimum identity to report. Inclusive
630 unless set to zero, in which case only
631 identities greater than zero (i.e. with at
632 least one shared hash) will be reported.
633 Set to -1 to output everything. (-1-1).
634 Default: false
635
636 --mashscreen_v : Maximum p-value to report (0-1). Default:
637 false
638
639 --tuspy_run : Run the get_top_unique_mash_hits_genomes.py
640 script. Default: true
641
642 --tuspy_s : Absolute UNIX path to metadata text file
643 with the field separator, | and 5 fields:
644 serotype|asm_lvl|asm_url|snp_cluster_idEx:
645 serotype=Derby,antigen_formula=4:f,g:-|
646 Scaffold|402440|ftp://...|PDS000096654.2.
647 Mentioning this option will create a pickle
648 file for the provided metadata and exits.
649 Default: false
650
651 --tuspy_m : Absolute UNIX path to mash screen results
652 file. Default: false
653
654 --tuspy_ps : Absolute UNIX Path to serialized metadata
655 object in a pickle file. Default: /hpc/db/
656 bettercallsal/latest/index_metadata/
657 per_snp_cluster.ACC2SERO.pickle
658
659 --tuspy_gd : Absolute UNIX Path to directory containing
660 gzipped genome FASTA files. Default: /hpc/
661 db/bettercallsal/latest/scaffold_genomes
662
663 --tuspy_gds : Genome FASTA file suffix to search for in
664 the genome directory. Default:
665 _scaffolded_genomic.fna.gz
666
667 --tuspy_n : Return up to this many number of top N
668 unique genome accession hits. Default: 10
669
670 --sourmashsketch_run : Run `sourmash sketch dna` tool. Default:
671 true
672
673 --sourmashsketch_mode : Select which type of signatures to be
674 created: dna, protein, fromfile or
675 translate. Default: dna
676
677 --sourmashsketch_p : Signature parameters to use. Default: abund
678 ,scaled=1000,k=51,k=61,k=71
679
680 --sourmashsketch_file : <path> A text file containing a list of
681 sequence files to load. Default: false
682
683 --sourmashsketch_f : Recompute signatures even if the file
684 exists. Default: false
685
686 --sourmashsketch_merge : Merge all input files into one signature
687 file with the specified name. Default:
688 false
689
690 --sourmashsketch_singleton : Compute a signature for each sequence
691 record individually. Default: true
692
693 --sourmashsketch_name : Name the signature generated from each file
694 after the first record in the file.
695 Default: false
696
697 --sourmashsketch_randomize : Shuffle the list of input files randomly.
698 Default: false
699
700 --sourmashgather_run : Run `sourmash gather` tool. Default: true
701
702 --sourmashgather_n : Number of results to report. By default,
703 will terminate at --sourmashgather_thr_bp
704 value. Default: false
705
706 --sourmashgather_thr_bp : Reporting threshold (in bp) for estimated
707 overlap with remaining query. Default:
708 false
709
710 --sourmashgather_ignoreabn : Do NOT use k-mer abundances if present.
711 Default: false
712
713 --sourmashgather_prefetch : Use prefetch before gather. Default: false
714
715 --sourmashgather_noprefetch : Do not use prefetch before gather. Default
716 : false
717
718 --sourmashgather_ani_ci : Output confidence intervals for ANI
719 estimates. Default: true
720
721 --sourmashgather_k : The k-mer size to select. Default: 71
722
723 --sourmashgather_protein : Choose a protein signature. Default: false
724
725 --sourmashgather_noprotein : Do not choose a protein signature. Default
726 : false
727
728 --sourmashgather_dayhoff : Choose Dayhoff-encoded amino acid
729 signatures. Default: false
730
731 --sourmashgather_nodayhoff : Do not choose Dayhoff-encoded amino acid
732 signatures. Default: false
733
734 --sourmashgather_hp : Choose hydrophobic-polar-encoded amino acid
735 signatures. Default: false
736
737 --sourmashgather_nohp : Do not choose hydrophobic-polar-encoded
738 amino acid signatures. Default: false
739
740 --sourmashgather_dna : Choose DNA signature. Default: true
741
742 --sourmashgather_nodna : Do not choose DNA signature. Default: false
743
744 --sourmashgather_scaled : Scaled value should be between 100 and 1e6
745 . Default: false
746
747 --sourmashgather_inc_pat : Search only signatures that match this
748 pattern in name, filename, or md5. Default
749 : false
750
751 --sourmashgather_exc_pat : Search only signatures that do not match
752 this pattern in name, filename, or md5.
753 Default: false
754
755 --sourmashsearch_run : Run `sourmash search` tool. Default: false
756
757 --sourmashsearch_n : Number of results to report. By default,
758 will terminate at --sourmashsearch_thr
759 value. Default: false
760
761 --sourmashsearch_thr : Reporting threshold (similarity) to return
762 results. Default: 0
763
764 --sourmashsearch_contain : Score based on containment rather than
765 similarity. Default: false
766
767 --sourmashsearch_maxcontain : Score based on max containment rather than
768 similarity. Default: false
769
770 --sourmashsearch_ignoreabn : Do NOT use k-mer abundances if present.
771 Default: true
772
773 --sourmashsearch_ani_ci : Output confidence intervals for ANI
774 estimates. Default: false
775
776 --sourmashsearch_k : The k-mer size to select. Default: 71
777
778 --sourmashsearch_protein : Choose a protein signature. Default: false
779
780 --sourmashsearch_noprotein : Do not choose a protein signature. Default
781 : false
782
783 --sourmashsearch_dayhoff : Choose Dayhoff-encoded amino acid
784 signatures. Default: false
785
786 --sourmashsearch_nodayhoff : Do not choose Dayhoff-encoded amino acid
787 signatures. Default: false
788
789 --sourmashsearch_hp : Choose hydrophobic-polar-encoded amino acid
790 signatures. Default: false
791
792 --sourmashsearch_nohp : Do not choose hydrophobic-polar-encoded
793 amino acid signatures. Default: false
794
795 --sourmashsearch_dna : Choose DNA signature. Default: true
796
797 --sourmashsearch_nodna : Do not choose DNA signature. Default: false
798
799 --sourmashsearch_scaled : Scaled value should be between 100 and 1e6
800 . Default: false
801
802 --sourmashsearch_inc_pat : Search only signatures that match this
803 pattern in name, filename, or md5. Default
804 : false
805
806 --sourmashsearch_exc_pat : Search only signatures that do not match
807 this pattern in name, filename, or md5.
808 Default: false
809
810 --sfhpy_run : Run the sourmash_filter_hits.py script.
811 Default: true
812
813 --sfhpy_fcn : Column name by which filtering of rows
814 should be applied. Default: f_match
815
816 --sfhpy_fcv : Remove genomes whose match with the query
817 FASTQ is less than this much. Default: 0.1
818
819 --sfhpy_gt : Apply greather than or equal to condition
820 on numeric values of --sfhpy_fcn column.
821 Default: true
822
823 --sfhpy_lt : Apply less than or equal to condition on
824 numeric values of --sfhpy_fcn column.
825 Default: false
826
827 --kmaindex_run : Run kma index tool. Default: true
828
829 --kmaindex_t_db : Add to existing DB. Default: false
830
831 --kmaindex_k : k-mer size. Default: 31
832
833 --kmaindex_m : Minimizer size. Default: false
834
835 --kmaindex_hc : Homopolymer compression. Default: false
836
837 --kmaindex_ML : Minimum length of templates. Defaults to --
838 kmaindex_k Default: false
839
840 --kmaindex_ME : Mega DB. Default: false
841
842 --kmaindex_Sparse : Make Sparse DB. Default: false
843
844 --kmaindex_ht : Homology template. Default: false
845
846 --kmaindex_hq : Homology query. Default: false
847
848 --kmaindex_and : Both homology thresholds have to reach.
849 Default: false
850
851 --kmaindex_nbp : No bias print. Default: false
852
853 --kmaalign_run : Run kma tool. Default: true
854
855 --kmaalign_int : Input file has interleaved reads. Default
856 : false
857
858 --kmaalign_ef : Output additional features. Default: false
859
860 --kmaalign_vcf : Output vcf file. 2 to apply FT. Default:
861 false
862
863 --kmaalign_sam : Output SAM, 4/2096 for mapped/aligned.
864 Default: false
865
866 --kmaalign_nc : No consensus file. Default: true
867
868 --kmaalign_na : No aln file. Default: true
869
870 --kmaalign_nf : No frag file. Default: true
871
872 --kmaalign_a : Output all template mappings. Default:
873 false
874
875 --kmaalign_and : Use both -mrs and p-value on consensus.
876 Default: false
877
878 --kmaalign_oa : Use neither -mrs or p-value on consensus.
879 Default: false
880
881 --kmaalign_bc : Minimum support to call bases. Default:
882 false
883
884 --kmaalign_bcNano : Altered indel calling for ONT data. Default
885 : false
886
887 --kmaalign_bcd : Minimum depth to call bases. Default: false
888
889 --kmaalign_bcg : Maintain insignificant gaps. Default: false
890
891 --kmaalign_ID : Minimum consensus ID. Default: false
892
893 --kmaalign_md : Minimum depth. Default: false
894
895 --kmaalign_dense : Skip insertion in consensus. Default: false
896
897 --kmaalign_ref_fsa : Use Ns on indels. Default: false
898
899 --kmaalign_Mt1 : Map everything to one template. Default:
900 false
901
902 --kmaalign_1t1 : Map one query to one template. Default:
903 false
904
905 --kmaalign_mrs : Minimum relative alignment score. Default:
906 false
907
908 --kmaalign_mrc : Minimum query coverage. Default: 0.99
909
910 --kmaalign_mp : Minimum phred score of trailing and leading
911 bases. Default: 30
912
913 --kmaalign_mq : Set the minimum mapping quality. Default:
914 false
915
916 --kmaalign_eq : Minimum average quality score. Default: 30
917
918 --kmaalign_5p : Trim 5 prime by this many bases. Default:
919 false
920
921 --kmaalign_3p : Trim 3 prime by this many bases Default:
922 false
923
924 --kmaalign_apm : Sets both -pm and -fpm Default: false
925
926 --kmaalign_cge : Set CGE penalties and rewards Default:
927 false
928
929 --salmonidx_run : Run `salmon index` tool. Default: true
930
931 --salmonidx_k : The size of k-mers that should be used for
932 the quasi index. Default: false
933
934 --salmonidx_gencode : This flag will expect the input transcript
935 FASTA to be in GENCODE format, and will
936 split the transcript name at the first `|`
937 character. These reduced names will be used
938 in the output and when looking for these
939 transcripts in a gene to transcript GTF.
940 Default: false
941
942 --salmonidx_features : This flag will expect the input reference
943 to be in the tsv file format, and will
944 split the feature name at the first `tab`
945 character. These reduced names will be used
946 in the output and when looking for the
947 sequence of the features. GTF. Default:
948 false
949
950 --salmonidx_keepDuplicates : This flag will disable the default indexing
951 behavior of discarding sequence-identical
952 duplicate transcripts. If this flag is
953 passed then duplicate transcripts that
954 appear in the input will be retained and
955 quantified separately. Default: false
956
957 --salmonidx_keepFixedFasta : Retain the fixed fasta file (without short
958 transcripts and duplicates, clipped, etc.)
959 generated during indexing. Default: false
960
961 --salmonidx_filterSize : The size of the Bloom filter that will be
962 used by TwoPaCo during indexing. The filter
963 will be of size 2^{filterSize}. A value of
964 -1 means that the filter size will be
965 automatically set based on the number of
966 distinct k-mers in the input, as estimated
967 by nthll. Default: false
968
969 --salmonidx_sparse : Build the index using a sparse sampling of
970 k-mer positions This will require less
971 memory (especially during quantification),
972 but will take longer to constructand can
973 slow down mapping / alignment. Default:
974 false
975
976 --salmonidx_n : Do not clip poly-A tails from the ends of
977 target sequences. Default: false
978
979 --gsrpy_run : Run the gen_salmon_res_table.py script.
980 Default: true
981
982 --gsrpy_url : Generate an additional column in final
983 results table which links out to NCBI
984 Pathogens Isolate Browser. Default: true
985
986 Help options :
987
988 --help : Display this message.
989
990 ```