Mercurial > repos > kkonganti > cfsan_bettercallsal

diff 0.5.0/readme/bettercallsal.md @ 1:365849f031fd
"planemo upload"
author: kkonganti
date: Mon, 05 Jun 2023 18:48:51 -0400
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/0.5.0/readme/bettercallsal.md	Mon Jun 05 18:48:51 2023 -0400
@@ -0,0 +1,990 @@
+# bettercallsal
+
+`bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture.
+
+\
+&nbsp;
+
+<!-- TOC -->
+
+- [Minimum Requirements](#minimum-requirements)
+- [Usage and Examples](#usage-and-examples)
+  - [Database](#database)
+  - [Input](#input)
+  - [Output](#output)
+  - [Computational resources](#computational-resources)
+  - [Runtime profiles](#runtime-profiles)
+  - [your_institution.config](#your_institutionconfig)
+  - [Cloud computing](#cloud-computing)
+  - [Example data](#example-data)
+- [Using sourmash](#using-sourmash)
+- [bettercallsal CLI Help](#bettercallsal-cli-help)
+
+<!-- /TOC -->
+
+\
+&nbsp;
+
+## Minimum Requirements
+
+1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow).
+    - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
+    - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK):  [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
+2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`.
+    - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
+    - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`.
+    - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
+3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
+
+\
+&nbsp;
+
+## Usage and Examples
+
+Clone or download this repository and then call `cpipes`.
+
+```bash
+cpipes --pipeline bettercallsal [options]
+```
+
+\
+&nbsp;
+
+**Example**: Run the default `bettercallsal` pipeline in single-end mode.
+
+```bash
+cd /data/scratch/$USER
+mkdir nf-cpipes
+cd nf-cpipes
+cpipes
+      --pipeline bettercallsal \
+      --input /path/to/illumina/fastq/dir \
+      --output /path/to/output \
+      --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db
+```
+
+\
+&nbsp;
+
+**Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [preprint](https://www.biorxiv.org/content/10.1101/2023.04.06.535929v1.full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`.
+
+```bash
+cd /data/scratch/$USER
+mkdir nf-cpipes
+cd nf-cpipes
+cpipes \
+      --pipeline bettercallsal \
+      --input /path/to/illumina/fastq/dir \
+      --output /path/to/output \
+      --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db \
+      --fq_single_end false \
+      --fq_suffix '_R1_001.fastq.gz'
+```
+
+\
+&nbsp;
+
+### Database
+
+---
+
+The successful run of the workflow requires certain database flat files specific for the workflow.
+
+Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release.
+
+&nbsp;
+
+### Input
+
+---
+
+The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
+
+For example, if the directory contains FASTQ files as shown below:
+
+- KB-01_apple_L001_R1.fastq.gz
+- KB-01_apple_L001_R2.fastq.gz
+- KB-01_apple_L002_R1.fastq.gz
+- KB-01_apple_L002_R2.fastq.gz
+- KB-02_mango_L001_R1.fastq.gz
+- KB-02_mango_L001_R2.fastq.gz
+- KB-02_mango_L002_R1.fastq.gz
+- KB-02_mango_L002_R2.fastq.gz
+
+Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
+
+This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
+
+\
+&nbsp;
+
+### Output
+
+---
+
+All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.
+
+\
+&nbsp;
+
+### Computational resources
+
+---
+
+The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.
+
+\
+&nbsp;
+
+Example:
+
+```bash
+cpipes \
+    --pipeline bettercallsal \
+    --input /path/to/bettercallsal_sim_reads \
+    --output /path/to/bettercallsal_sim_reads_output \
+    --bcs_root_dbdir /path/to/PDG000000002.2537
+    --kmaalign_ignorequals \
+    --max_cpus 5 \
+    -profile stdkondagac \
+    -resume
+```
+
+\
+&nbsp;
+
+### Runtime profiles
+
+---
+
+You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
+
+\
+&nbsp;
+
+Example:
+
+```bash
+cd /data/scratch/$USER
+mkdir nf-cpipes
+cd nf-cpipes
+cpipes \
+    --pipeline bettercallsal \
+    --input /path/to/fastq_pass_dir \
+    --output /path/to/where/output/should/go \
+    -profile your_institution
+```
+
+The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.
+
+\
+&nbsp;
+
+### `your_institution.config`
+
+---
+
+In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
+
+\
+&nbsp;
+
+```groovy
+your_institution {
+    process.executor = 'sge'
+    process.queue = 'normal.q'
+    singularity.enabled = false
+    singularity.autoMounts = true
+    docker.enabled = false
+    params.enable_conda = true
+    conda.enabled = true
+    conda.useMicromamba = true
+    params.enable_module = false
+}
+```
+
+In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
+
+\
+&nbsp;
+
+### Cloud computing
+
+---
+
+You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
+
+\
+&nbsp;
+
+Example:
+
+```groovy
+my_aws_batch {
+    executor = 'awsbatch'
+    queue = 'my-batch-queue'
+    aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
+    aws.batch.region = 'us-east-1'
+    singularity.enabled = false
+    singularity.autoMounts = true
+    docker.enabled = true
+    params.conda_enabled = false
+    params.enable_module = false
+}
+```
+
+\
+&nbsp;
+
+### Example data
+
+---
+
+After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions.
+
+- Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB).
+- Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads.
+- Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.2537](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2537.tar.bz2) version of the database (~ 37 GB).
+- After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html).
+
+Now run the workflow by ignoring quality values since these are simulated base qualities:
+
+\
+&nbsp;
+
+```bash
+cpipes \
+    --pipeline bettercallsal \
+    --input /path/to/bettercallsal_sim_reads \
+    --output /path/to/bettercallsal_sim_reads_output \
+    --bcs_root_dbdir /path/to/PDG000000002.2537
+    --kmaalign_ignorequals \
+    -profile stdkondagac \
+    -resume
+```
+
+Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
+
+\
+&nbsp;
+
+## Using `sourmash`
+
+Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype.
+
+You can turn **OFF** this feature with `--sourmashsketch_run false` option.
+
+\
+&nbsp;
+
+## `bettercallsal` CLI Help
+
+```text
+[Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help
+N E X T F L O W  ~  version 22.10.0
+Launching `./bettercallsal/cpipes` [awesome_chandrasekhar] DSL2 - revision: 8da4e11078
+================================================================================
+             (o)
+  ___  _ __   _  _ __    ___  ___
+ / __|| '_ \ | || '_ \  / _ \/ __|
+| (__ | |_) || || |_) ||  __/\__ \
+ \___|| .__/ |_|| .__/  \___||___/
+      | |       | |
+      |_|       |_|
+--------------------------------------------------------------------------------
+A collection of modular pipelines at CFSAN, FDA.
+--------------------------------------------------------------------------------
+Name                            : CPIPES
+Author                          : Kranti Konganti
+Version                         : 0.5.0
+Center                          : CFSAN, FDA.
+================================================================================
+
+Workflow                        : bettercallsal
+
+Author                          : Kranti Konganti
+
+Version                         : 0.5.0
+
+
+Usage                           : cpipes --pipeline bettercallsal [options]
+
+
+Required                        :
+
+--input                         : Absolute path to directory containing FASTQ
+                                  files. The directory should contain only
+                                  FASTQ files as all the files within the
+                                  mentioned directory will be read. Ex: --
+                                  input /path/to/fastq_pass
+
+--output                        : Absolute path to directory where all the
+                                  pipeline outputs should be stored. Ex: --
+                                  output /path/to/output
+
+Other options                   :
+
+--metadata                      : Absolute path to metadata CSV file
+                                  containing five mandatory columns: sample,
+                                  fq1,fq2,strandedness,single_end. The fq1
+                                  and fq2 columns contain absolute paths to
+                                  the FASTQ files. This option can be used in
+                                  place of --input option. This is rare. Ex
+                                  : --metadata samplesheet.csv
+
+--fq_suffix                     : The suffix of FASTQ files (Unpaired reads
+                                  or R1 reads or Long reads) if an input
+                                  directory is mentioned via --input option.
+                                  Default: .fastq.gz
+
+--fq2_suffix                    : The suffix of FASTQ files (Paired-end reads
+                                  or R2 reads) if an input directory is
+                                  mentioned via --input option. Default:
+                                  _R2_001.fastq.gz
+
+--fq_filter_by_len              : Remove FASTQ reads that are less than this
+                                  many bases. Default: 0
+
+--fq_strandedness               : The strandedness of the sequencing run.
+                                  This is mostly needed if your sequencing
+                                  run is RNA-SEQ. For most of the other runs
+                                  , it is probably safe to use unstranded for
+                                  the option. Default: unstranded
+
+--fq_single_end                 : SINGLE-END information will be auto-
+                                  detected but this option forces PAIRED-END
+                                  FASTQ files to be treated as SINGLE-END so
+                                  only read 1 information is included in auto
+                                  -generated samplesheet. Default: true
+
+--fq_filename_delim             : Delimiter by which the file name is split
+                                  to obtain sample name. Default: _
+
+--fq_filename_delim_idx         : After splitting FASTQ file name by using
+                                  the --fq_filename_delim option, all
+                                  elements before this index (1-based) will
+                                  be joined to create final sample name.
+                                  Default: 1
+
+--bcs_concat_pe                 : Concatenate paired-end files. Default: true
+
+--bbmerge_run                   : Run BBMerge tool. Default: false
+
+--bbmerge_reads                 : Quit after this many read pairs (-1 means
+                                  all) Default: -1
+
+--bbmerge_adapters              : Absolute UNIX path pointing to the adapters
+                                  file in FASTA format. Default: false
+
+--bbmerge_ziplevel              : Set to 1 (lowest) through 9 (max) to change
+                                  compression level; lower compression is
+                                  faster. Default: 1
+
+--bbmerge_ordered               : Output reads in the same order as input.
+                                  Default: false
+
+--bbmerge_qtrim                 : Trim read ends to remove bases with quality
+                                  below --bbmerge_minq. Trims BEFORE merging
+                                  . Values: t (trim both ends), f (neither
+                                  end), r (right end only), l (left end only
+                                  ). Default: true
+
+--bbmerge_qtrim2                : May be specified instead of --bbmerge_qtrim
+                                  to perform trimming only if merging is
+                                  unsuccesful. then retry merging. Default:
+                                  false
+
+--bbmerge_trimq                 : Trim quality threshold. This may be comma-
+                                  delimited list (ascending) to try multiple
+                                  values. Default: 10
+
+--bbmerge_minlength             : (ml) Reads shorter than this after trimming
+                                  , but before merging, will be discarded.
+                                  Pairs will be discarded onlyif both are
+                                  shorter. Default: 1
+
+--bbmerge_tbo                   : (trimbyoverlap). Trim overlapping reads to
+                                  remove right most (3') non-overlaping
+                                  portion instead of joining Default: false
+
+--bbmerge_minavgquality         : (maq). Reads with average quality below
+                                  this after trimming will not be attempted
+                                  to merge. Default: 30
+
+--bbmerge_trimpolya             : Trim trailing poly-A tail from adapter
+                                  output. Only affects outadapter.  This also
+                                  trims poly-A followed by poly-G, which
+                                  occurs on NextSeq. Default: true
+
+--bbmerge_pfilter               : Ban improbable overlaps. Higher is more
+                                  strict. 0 will disable the filter; 1 will
+                                  allow only perfect overlaps. Default: 1
+
+--bbmerge_ouq                   : Calculate best overlap using quality values
+                                  . Default: false
+
+--bbmerge_owq                   : Calculate best overlap without using
+                                  quality values. Default: true
+
+--bbmerge_strict                : Decrease false positive rate and merging
+                                  rate. Default: false
+
+--bbmerge_verystrict            : Greatly decrease false positive rate and
+                                  merging rate. Default: false
+
+--bbmerge_ultrastrict           : Decrease false positive rate and merging
+                                  rate even more. Default: true
+
+--bbmerge_maxstrict             : Maxiamally decrease false positive rate and
+                                  merging rate. Default: false
+
+--bbmerge_loose                 : Increase false positive rate and merging
+                                  rate. Default: false
+
+--bbmerge_veryloose             : Greatly increase false positive rate and
+                                  merging rate. Default: false
+
+--bbmerge_ultraloose            : Increase false positive rate and merging
+                                  rate even more. Default: false
+
+--bbmerge_maxloose              : Maximally increase false positive rate and
+                                  merging rate. Default: false
+
+--bbmerge_fast                  : Fastest possible preset. Default: false
+
+--bbmerge_k                     : Kmer length.  31 (or less) is fastest and
+                                  uses the least memory, but higher values
+                                  may be more accurate. 60 tends to work well
+                                  for 150bp reads. Default: 60
+
+--bbmerge_prealloc              : Pre-allocate memory rather than dynamically
+                                  growing. Faster and more memory-efficient
+                                  for large datasets. A float fraction (0-1)
+                                  may be specified, default 1. Default: true
+
+--fastp_run                     : Run fastp tool. Default: true
+
+--fastp_failed_out              : Specify whether to store reads that cannot
+                                  pass the filters. Default: false
+
+--fastp_merged_out              : Specify whether to store merged output or
+                                  not. Default: false
+
+--fastp_overlapped_out          : For each read pair, output the overlapped
+                                  region if it has no mismatched base.
+                                  Default: false
+
+--fastp_6                       : Indicate that the input is using phred64
+                                  scoring (it'll be converted to phred33, so
+                                  the output will still be phred33). Default
+                                  : false
+
+--fastp_reads_to_process        : Specify how many reads/pairs are to be
+                                  processed. Default value 0 means process
+                                  all reads. Default: 0
+
+--fastp_fix_mgi_id              : The MGI FASTQ ID format is not compatible
+                                  with many BAM operation tools, enable this
+                                  option to fix it. Default: false
+
+--fastp_A                       : Disable adapter trimming. On by default.
+                                  Default: false
+
+--fastp_adapter_fasta           : Specify a FASTA file to trim both read1 and
+                                  read2 (if PE) by all the sequences in this
+                                  FASTA file. Default: false
+
+--fastp_f                       : Trim how many bases in front of read1.
+                                  Default: 0
+
+--fastp_t                       : Trim how many bases at the end of read1.
+                                  Default: 0
+
+--fastp_b                       : Max length of read1 after trimming. Default
+                                  : 0
+
+--fastp_F                       : Trim how many bases in front of read2.
+                                  Default: 0
+
+--fastp_T                       : Trim how many bases at the end of read2.
+                                  Default: 0
+
+--fastp_B                       : Max length of read2 after trimming. Default
+                                  : 0
+
+--fastp_dedup                   : Enable deduplication to drop the duplicated
+                                  reads/pairs. Default: true
+
+--fastp_dup_calc_accuracy       : Accuracy level to calculate duplication (1~
+                                  6), higher level uses more memory (1G, 2G,
+                                  4G, 8G, 16G, 24G). Default 1 for no-dedup
+                                  mode, and 3 for dedup mode. Default: 6
+
+--fastp_poly_g_min_len          : The minimum length to detect polyG in the
+                                  read tail. Default: 10
+
+--fastp_G                       : Disable polyG tail trimming. Default: true
+
+--fastp_x                       : Enable polyX trimming in 3' ends. Default:
+                                  false
+
+--fastp_poly_x_min_len          : The minimum length to detect polyX in the
+                                  read tail. Default: 10
+
+--fastp_cut_front               : Move a sliding window from front (5') to
+                                  tail, drop the bases in the window if its
+                                  mean quality < threshold, stop otherwise.
+                                  Default: true
+
+--fastp_cut_tail                : Move a sliding window from tail (3') to
+                                  front, drop the bases in the window if its
+                                  mean quality < threshold, stop otherwise.
+                                  Default: false
+
+--fastp_cut_right               : Move a sliding window from tail, drop the
+                                  bases in the window and the right part if
+                                  its mean quality < threshold, and then stop
+                                  . Default: true
+
+--fastp_W                       : Sliding window size shared by --
+                                  fastp_cut_front, --fastp_cut_tail and --
+                                  fastp_cut_right. Default: 20
+
+--fastp_M                       : The mean quality requirement shared by --
+                                  fastp_cut_front, --fastp_cut_tail and --
+                                  fastp_cut_right. Default: 30
+
+--fastp_q                       : The quality value below which a base should
+                                  is not qualified. Default: 30
+
+--fastp_u                       : What percent of bases are allowed to be
+                                  unqualified. Default: 40
+
+--fastp_n                       : How many N's can a read have. Default: 5
+
+--fastp_e                       : If the full reads' average quality is below
+                                  this value, then it is discarded. Default
+                                  : 0
+
+--fastp_l                       : Reads shorter than this length will be
+                                  discarded. Default: 35
+
+--fastp_max_len                 : Reads longer than this length will be
+                                  discarded. Default: 0
+
+--fastp_y                       : Enable low complexity filter. The
+                                  complexity is defined as the percentage of
+                                  bases that are different from its next base
+                                  (base[i] != base[i+1]). Default: true
+
+--fastp_Y                       : The threshold for low complexity filter (0~
+                                  100). Ex: A value of 30 means 30%
+                                  complexity is required. Default: 30
+
+--fastp_U                       : Enable Unique Molecular Identifier (UMI)
+                                  pre-processing. Default: false
+
+--fastp_umi_loc                 : Specify the location of UMI, can be one of
+                                  index1/index2/read1/read2/per_index/
+                                  per_read. Default: false
+
+--fastp_umi_len                 : If the UMI is in read1 or read2, its length
+                                  should be provided. Default: false
+
+--fastp_umi_prefix              : If specified, an underline will be used to
+                                  connect prefix and UMI (i.e. prefix=UMI,
+                                  UMI=AATTCG, final=UMI_AATTCG). Default:
+                                  false
+
+--fastp_umi_skip                : If the UMI is in read1 or read2, fastp can
+                                  skip several bases following the UMI.
+                                  Default: false
+
+--fastp_p                       : Enable overrepresented sequence analysis.
+                                  Default: true
+
+--fastp_P                       : One in this many number of reads will be
+                                  computed for overrepresentation analysis (1
+                                  ~10000), smaller is slower. Default: 20
+
+--fastp_use_custom_adapaters    : Use custom adapter FASTA with fastp on top
+                                  of built-in adapter sequence auto-detection
+                                  . Enabling this option will attempt to find
+                                  and remove all possible Illumina adapter
+                                  and primer sequences but will make the
+                                  workflow run slow. Default: false
+
+--mashscreen_run                : Run `mash screen` tool. Default: true
+
+--mashscreen_w                  : Winner-takes-all strategy for identity
+                                  estimates. After counting hashes for each
+                                  query, hashes that appear in multiple
+                                  queries will be removed from all except the
+                                  one with the best identity (ties broken by
+                                  larger query), and other identities will
+                                  be reduced. This removes output redundancy
+                                  , providing a rough compositional outline
+                                  .  Default: false
+
+--mashscreen_i                  : Minimum identity to report. Inclusive
+                                  unless set to zero, in which case only
+                                  identities greater than zero (i.e. with at
+                                  least one shared hash) will be reported.
+                                  Set to -1 to output everything. (-1-1).
+                                  Default: false
+
+--mashscreen_v                  : Maximum p-value to report (0-1). Default:
+                                  false
+
+--tuspy_run                     : Run the get_top_unique_mash_hits_genomes.py
+                                  script. Default: true
+
+--tuspy_s                       : Absolute UNIX path to metadata text file
+                                  with the field separator, | and 5 fields:
+                                  serotype|asm_lvl|asm_url|snp_cluster_idEx:
+                                  serotype=Derby,antigen_formula=4:f,g:-|
+                                  Scaffold|402440|ftp://...|PDS000096654.2.
+                                  Mentioning this option will create a pickle
+                                  file for the provided metadata and exits.
+                                  Default: false
+
+--tuspy_m                       : Absolute UNIX path to mash screen results
+                                  file. Default: false
+
+--tuspy_ps                      : Absolute UNIX Path to serialized metadata
+                                  object in a pickle file. Default: /hpc/db/
+                                  bettercallsal/latest/index_metadata/
+                                  per_snp_cluster.ACC2SERO.pickle
+
+--tuspy_gd                      : Absolute UNIX Path to directory containing
+                                  gzipped genome FASTA files. Default: /hpc/
+                                  db/bettercallsal/latest/scaffold_genomes
+
+--tuspy_gds                     : Genome FASTA file suffix to search for in
+                                  the genome directory. Default:
+                                  _scaffolded_genomic.fna.gz
+
+--tuspy_n                       : Return up to this many number of top N
+                                  unique genome accession hits. Default: 10
+
+--sourmashsketch_run            : Run `sourmash sketch dna` tool. Default:
+                                  true
+
+--sourmashsketch_mode           : Select which type of signatures to be
+                                  created: dna, protein, fromfile or
+                                  translate. Default: dna
+
+--sourmashsketch_p              : Signature parameters to use. Default: abund
+                                  ,scaled=1000,k=51,k=61,k=71
+
+--sourmashsketch_file           : <path>  A text file containing a list of
+                                  sequence files to load. Default: false
+
+--sourmashsketch_f              : Recompute signatures even if the file
+                                  exists. Default: false
+
+--sourmashsketch_merge          : Merge all input files into one signature
+                                  file with the specified name. Default:
+                                  false
+
+--sourmashsketch_singleton      : Compute a signature for each sequence
+                                  record individually. Default: true
+
+--sourmashsketch_name           : Name the signature generated from each file
+                                  after the first record in the file.
+                                  Default: false
+
+--sourmashsketch_randomize      : Shuffle the list of input files randomly.
+                                  Default: false
+
+--sourmashgather_run            : Run `sourmash gather` tool. Default: true
+
+--sourmashgather_n              : Number of results to report. By default,
+                                  will terminate at --sourmashgather_thr_bp
+                                  value. Default: false
+
+--sourmashgather_thr_bp         : Reporting threshold (in bp) for estimated
+                                  overlap with remaining query. Default:
+                                  false
+
+--sourmashgather_ignoreabn      : Do NOT use k-mer abundances if present.
+                                  Default: false
+
+--sourmashgather_prefetch       : Use prefetch before gather. Default: false
+
+--sourmashgather_noprefetch     : Do not use prefetch before gather. Default
+                                  : false
+
+--sourmashgather_ani_ci         : Output confidence intervals for ANI
+                                  estimates. Default: true
+
+--sourmashgather_k              : The k-mer size to select. Default: 71
+
+--sourmashgather_protein        : Choose a protein signature. Default: false
+
+--sourmashgather_noprotein      : Do not choose a protein signature. Default
+                                  : false
+
+--sourmashgather_dayhoff        : Choose Dayhoff-encoded amino acid
+                                  signatures. Default: false
+
+--sourmashgather_nodayhoff      : Do not choose Dayhoff-encoded amino acid
+                                  signatures. Default: false
+
+--sourmashgather_hp             : Choose hydrophobic-polar-encoded amino acid
+                                  signatures. Default: false
+
+--sourmashgather_nohp           : Do not choose hydrophobic-polar-encoded
+                                  amino acid signatures. Default: false
+
+--sourmashgather_dna            : Choose DNA signature. Default: true
+
+--sourmashgather_nodna          : Do not choose DNA signature. Default: false
+
+--sourmashgather_scaled         : Scaled value should be between 100 and 1e6
+                                  . Default: false
+
+--sourmashgather_inc_pat        : Search only signatures that match this
+                                  pattern in name, filename, or md5. Default
+                                  : false
+
+--sourmashgather_exc_pat        : Search only signatures that do not match
+                                  this pattern in name, filename, or md5.
+                                  Default: false
+
+--sourmashsearch_run            : Run `sourmash search` tool. Default: false
+
+--sourmashsearch_n              : Number of results to report. By default,
+                                  will terminate at --sourmashsearch_thr
+                                  value. Default: false
+
+--sourmashsearch_thr            : Reporting threshold (similarity) to return
+                                  results. Default: 0
+
+--sourmashsearch_contain        : Score based on containment rather than
+                                  similarity. Default: false
+
+--sourmashsearch_maxcontain     : Score based on max containment rather than
+                                  similarity. Default: false
+
+--sourmashsearch_ignoreabn      : Do NOT use k-mer abundances if present.
+                                  Default: true
+
+--sourmashsearch_ani_ci         : Output confidence intervals for ANI
+                                  estimates. Default: false
+
+--sourmashsearch_k              : The k-mer size to select. Default: 71
+
+--sourmashsearch_protein        : Choose a protein signature. Default: false
+
+--sourmashsearch_noprotein      : Do not choose a protein signature. Default
+                                  : false
+
+--sourmashsearch_dayhoff        : Choose Dayhoff-encoded amino acid
+                                  signatures. Default: false
+
+--sourmashsearch_nodayhoff      : Do not choose Dayhoff-encoded amino acid
+                                  signatures. Default: false
+
+--sourmashsearch_hp             : Choose hydrophobic-polar-encoded amino acid
+                                  signatures. Default: false
+
+--sourmashsearch_nohp           : Do not choose hydrophobic-polar-encoded
+                                  amino acid signatures. Default: false
+
+--sourmashsearch_dna            : Choose DNA signature. Default: true
+
+--sourmashsearch_nodna          : Do not choose DNA signature. Default: false
+
+--sourmashsearch_scaled         : Scaled value should be between 100 and 1e6
+                                  . Default: false
+
+--sourmashsearch_inc_pat        : Search only signatures that match this
+                                  pattern in name, filename, or md5. Default
+                                  : false
+
+--sourmashsearch_exc_pat        : Search only signatures that do not match
+                                  this pattern in name, filename, or md5.
+                                  Default: false
+
+--sfhpy_run                     : Run the sourmash_filter_hits.py script.
+                                  Default: true
+
+--sfhpy_fcn                     : Column name by which filtering of rows
+                                  should be applied. Default: f_match
+
+--sfhpy_fcv                     : Remove genomes whose match with the query
+                                  FASTQ is less than this much. Default: 0.1
+
+--sfhpy_gt                      : Apply greather than or equal to condition
+                                  on numeric values of --sfhpy_fcn column.
+                                  Default: true
+
+--sfhpy_lt                      : Apply less than or equal to condition on
+                                  numeric values of --sfhpy_fcn column.
+                                  Default: false
+
+--kmaindex_run                  : Run kma index tool. Default: true
+
+--kmaindex_t_db                 : Add to existing DB. Default: false
+
+--kmaindex_k                    : k-mer size. Default: 31
+
+--kmaindex_m                    : Minimizer size. Default: false
+
+--kmaindex_hc                   : Homopolymer compression. Default: false
+
+--kmaindex_ML                   : Minimum length of templates. Defaults to --
+                                  kmaindex_k Default: false
+
+--kmaindex_ME                   : Mega DB. Default: false
+
+--kmaindex_Sparse               : Make Sparse DB. Default: false
+
+--kmaindex_ht                   : Homology template. Default: false
+
+--kmaindex_hq                   : Homology query. Default: false
+
+--kmaindex_and                  : Both homology thresholds have to reach.
+                                  Default: false
+
+--kmaindex_nbp                  : No bias print. Default: false
+
+--kmaalign_run                  : Run kma tool. Default: true
+
+--kmaalign_int                  : Input file has interleaved reads.  Default
+                                  : false
+
+--kmaalign_ef                   : Output additional features. Default: false
+
+--kmaalign_vcf                  : Output vcf file. 2 to apply FT. Default:
+                                  false
+
+--kmaalign_sam                  : Output SAM, 4/2096 for mapped/aligned.
+                                  Default: false
+
+--kmaalign_nc                   : No consensus file. Default: true
+
+--kmaalign_na                   : No aln file. Default: true
+
+--kmaalign_nf                   : No frag file. Default: true
+
+--kmaalign_a                    : Output all template mappings. Default:
+                                  false
+
+--kmaalign_and                  : Use both -mrs and p-value on consensus.
+                                  Default: false
+
+--kmaalign_oa                   : Use neither -mrs or p-value on consensus.
+                                  Default: false
+
+--kmaalign_bc                   : Minimum support to call bases. Default:
+                                  false
+
+--kmaalign_bcNano               : Altered indel calling for ONT data. Default
+                                  : false
+
+--kmaalign_bcd                  : Minimum depth to call bases. Default: false
+
+--kmaalign_bcg                  : Maintain insignificant gaps. Default: false
+
+--kmaalign_ID                   : Minimum consensus ID. Default: false
+
+--kmaalign_md                   : Minimum depth. Default: false
+
+--kmaalign_dense                : Skip insertion in consensus. Default: false
+
+--kmaalign_ref_fsa              : Use Ns on indels. Default: false
+
+--kmaalign_Mt1                  : Map everything to one template. Default:
+                                  false
+
+--kmaalign_1t1                  : Map one query to one template. Default:
+                                  false
+
+--kmaalign_mrs                  : Minimum relative alignment score. Default:
+                                  false
+
+--kmaalign_mrc                  : Minimum query coverage. Default: 0.99
+
+--kmaalign_mp                   : Minimum phred score of trailing and leading
+                                  bases. Default: 30
+
+--kmaalign_mq                   : Set the minimum mapping quality. Default:
+                                  false
+
+--kmaalign_eq                   : Minimum average quality score. Default: 30
+
+--kmaalign_5p                   : Trim 5 prime by this many bases. Default:
+                                  false
+
+--kmaalign_3p                   : Trim 3 prime by this many bases Default:
+                                  false
+
+--kmaalign_apm                  : Sets both -pm and -fpm Default: false
+
+--kmaalign_cge                  : Set CGE penalties and rewards Default:
+                                  false
+
+--salmonidx_run                 : Run `salmon index` tool. Default: true
+
+--salmonidx_k                   : The size of k-mers that should be used for
+                                  the  quasi index. Default: false
+
+--salmonidx_gencode             : This flag will expect the input transcript
+                                  FASTA to be in GENCODE format, and will
+                                  split the transcript name at the first `|`
+                                  character. These reduced names will be used
+                                  in the output and when looking for these
+                                  transcripts in a gene to transcript GTF.
+                                  Default: false
+
+--salmonidx_features            : This flag will expect the input reference
+                                  to be in the tsv file format, and will
+                                  split the feature name at the first `tab`
+                                  character. These reduced names will be used
+                                  in the output and when looking for the
+                                  sequence of the features. GTF. Default:
+                                  false
+
+--salmonidx_keepDuplicates      : This flag will disable the default indexing
+                                  behavior of discarding sequence-identical
+                                  duplicate transcripts. If this flag is
+                                  passed then duplicate transcripts that
+                                  appear in the input will be retained and
+                                  quantified separately. Default: false
+
+--salmonidx_keepFixedFasta      : Retain the fixed fasta file (without short
+                                  transcripts and duplicates, clipped, etc.)
+                                  generated during indexing. Default: false
+
+--salmonidx_filterSize          : The size of the Bloom filter that will be
+                                  used by TwoPaCo during indexing. The filter
+                                  will be of size 2^{filterSize}. A value of
+                                  -1 means that the filter size will be
+                                  automatically set based on the number of
+                                  distinct k-mers in the input, as estimated
+                                  by nthll. Default: false
+
+--salmonidx_sparse              : Build the index using a sparse sampling of
+                                  k-mer positions This will require less
+                                  memory (especially during quantification),
+                                  but will take longer to constructand can
+                                  slow down mapping / alignment. Default:
+                                  false
+
+--salmonidx_n                   : Do not clip poly-A tails from the ends of
+                                  target sequences. Default: false
+
+--gsrpy_run                     : Run the gen_salmon_res_table.py script.
+                                  Default: true
+
+--gsrpy_url                     : Generate an additional column in final
+                                  results table which links out to NCBI
+                                  Pathogens Isolate Browser.  Default: true
+
+Help options                    :
+
+--help                          : Display this message.
+
+```
author	kkonganti
date	Mon, 05 Jun 2023 18:48:51 -0400
parents
children