Mercurial > repos > kkonganti > cfsan_bettercallsal

# bettercallsal

`bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture.

\
&nbsp;

<!-- TOC -->

- [Minimum Requirements](#minimum-requirements)
- [Usage and Examples](#usage-and-examples)
  - [Database](#database)
  - [Input](#input)
  - [Output](#output)
  - [Computational resources](#computational-resources)
  - [Runtime profiles](#runtime-profiles)
  - [your_institution.config](#your_institutionconfig)
  - [Cloud computing](#cloud-computing)
  - [Example data](#example-data)
- [Using sourmash](#using-sourmash)
- [bettercallsal CLI Help](#bettercallsal-cli-help)

<!-- /TOC -->

\
&nbsp;

## Minimum Requirements

1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow).
    - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
    - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK):  [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`.
    - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
    - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`.
    - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.

\
&nbsp;

## Usage and Examples

Clone or download this repository and then call `cpipes`.

```bash
cpipes --pipeline bettercallsal [options]
```

\
&nbsp;

**Example**: Run the default `bettercallsal` pipeline in single-end mode.

```bash
cd /data/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes
      --pipeline bettercallsal \
      --input /path/to/illumina/fastq/dir \
      --output /path/to/output \
      --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db
```

\
&nbsp;

**Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [preprint](https://www.biorxiv.org/content/10.1101/2023.04.06.535929v1.full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`.

```bash
cd /data/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes \
      --pipeline bettercallsal \
      --input /path/to/illumina/fastq/dir \
      --output /path/to/output \
      --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db \
      --fq_single_end false \
      --fq_suffix '_R1_001.fastq.gz'
```

\
&nbsp;

### Database

---

The successful run of the workflow requires certain database flat files specific for the workflow.

Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release.

&nbsp;

### Input

---

The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.

For example, if the directory contains FASTQ files as shown below:

- KB-01_apple_L001_R1.fastq.gz
- KB-01_apple_L001_R2.fastq.gz
- KB-01_apple_L002_R1.fastq.gz
- KB-01_apple_L002_R2.fastq.gz
- KB-02_mango_L001_R1.fastq.gz
- KB-02_mango_L001_R2.fastq.gz
- KB-02_mango_L002_R1.fastq.gz
- KB-02_mango_L002_R2.fastq.gz

Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).

This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.

\
&nbsp;

### Output

---

All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report.

\
&nbsp;

### Computational resources

---

The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.

\
&nbsp;

Example:

```bash
cpipes \
    --pipeline bettercallsal \
    --input /path/to/bettercallsal_sim_reads \
    --output /path/to/bettercallsal_sim_reads_output \
    --bcs_root_dbdir /path/to/PDG000000002.2537
    --kmaalign_ignorequals \
    --max_cpus 5 \
    -profile stdkondagac \
    -resume
```

\
&nbsp;

### Runtime profiles

---

You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.

\
&nbsp;

Example:

```bash
cd /data/scratch/$USER
mkdir nf-cpipes
cd nf-cpipes
cpipes \
    --pipeline bettercallsal \
    --input /path/to/fastq_pass_dir \
    --output /path/to/where/output/should/go \
    -profile your_institution
```

The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files.

\
&nbsp;

### `your_institution.config`

---

In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:

\
&nbsp;

```groovy
your_institution {
    process.executor = 'sge'
    process.queue = 'normal.q'
    singularity.enabled = false
    singularity.autoMounts = true
    docker.enabled = false
    params.enable_conda = true
    conda.enabled = true
    conda.useMicromamba = true
    params.enable_module = false
}
```

In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.

\
&nbsp;

### Cloud computing

---

You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):

\
&nbsp;

Example:

```groovy
my_aws_batch {
    executor = 'awsbatch'
    queue = 'my-batch-queue'
    aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
    aws.batch.region = 'us-east-1'
    singularity.enabled = false
    singularity.autoMounts = true
    docker.enabled = true
    params.conda_enabled = false
    params.enable_module = false
}
```

\
&nbsp;

### Example data

---

After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions.

- Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB).
- Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads.
- Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.2537](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2537.tar.bz2) version of the database (~ 37 GB).
- After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html).

Now run the workflow by ignoring quality values since these are simulated base qualities:

\
&nbsp;

```bash
cpipes \
    --pipeline bettercallsal \
    --input /path/to/bettercallsal_sim_reads \
    --output /path/to/bettercallsal_sim_reads_output \
    --bcs_root_dbdir /path/to/PDG000000002.2537
    --kmaalign_ignorequals \
    -profile stdkondagac \
    -resume
```

Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.

\
&nbsp;

## Using `sourmash`

Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype.

You can turn **OFF** this feature with `--sourmashsketch_run false` option.

\
&nbsp;

## `bettercallsal` CLI Help

```text
[Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help
N E X T F L O W  ~  version 22.10.0
Launching `./bettercallsal/cpipes` [awesome_chandrasekhar] DSL2 - revision: 8da4e11078
================================================================================
             (o)
  ___  _ __   _  _ __    ___  ___
 / __|| '_ \ | || '_ \  / _ \/ __|
| (__ | |_) || || |_) ||  __/\__ \
 \___|| .__/ |_|| .__/  \___||___/
      | |       | |
      |_|       |_|
--------------------------------------------------------------------------------
A collection of modular pipelines at CFSAN, FDA.
--------------------------------------------------------------------------------
Name                            : CPIPES
Author                          : Kranti Konganti
Version                         : 0.5.0
Center                          : CFSAN, FDA.
================================================================================

Workflow                        : bettercallsal

Author                          : Kranti Konganti

Version                         : 0.5.0


Usage                           : cpipes --pipeline bettercallsal [options]


Required                        :

--input                         : Absolute path to directory containing FASTQ
                                  files. The directory should contain only
                                  FASTQ files as all the files within the
                                  mentioned directory will be read. Ex: --
                                  input /path/to/fastq_pass

--output                        : Absolute path to directory where all the
                                  pipeline outputs should be stored. Ex: --
                                  output /path/to/output

Other options                   :

--metadata                      : Absolute path to metadata CSV file
                                  containing five mandatory columns: sample,
                                  fq1,fq2,strandedness,single_end. The fq1
                                  and fq2 columns contain absolute paths to
                                  the FASTQ files. This option can be used in
                                  place of --input option. This is rare. Ex
                                  : --metadata samplesheet.csv

--fq_suffix                     : The suffix of FASTQ files (Unpaired reads
                                  or R1 reads or Long reads) if an input
                                  directory is mentioned via --input option.
                                  Default: .fastq.gz

--fq2_suffix                    : The suffix of FASTQ files (Paired-end reads
                                  or R2 reads) if an input directory is
                                  mentioned via --input option. Default:
                                  _R2_001.fastq.gz

--fq_filter_by_len              : Remove FASTQ reads that are less than this
                                  many bases. Default: 0

--fq_strandedness               : The strandedness of the sequencing run.
                                  This is mostly needed if your sequencing
                                  run is RNA-SEQ. For most of the other runs
                                  , it is probably safe to use unstranded for
                                  the option. Default: unstranded

--fq_single_end                 : SINGLE-END information will be auto-
                                  detected but this option forces PAIRED-END
                                  FASTQ files to be treated as SINGLE-END so
                                  only read 1 information is included in auto
                                  -generated samplesheet. Default: true

--fq_filename_delim             : Delimiter by which the file name is split
                                  to obtain sample name. Default: _

--fq_filename_delim_idx         : After splitting FASTQ file name by using
                                  the --fq_filename_delim option, all
                                  elements before this index (1-based) will
                                  be joined to create final sample name.
                                  Default: 1

--bcs_concat_pe                 : Concatenate paired-end files. Default: true

--bbmerge_run                   : Run BBMerge tool. Default: false

--bbmerge_reads                 : Quit after this many read pairs (-1 means
                                  all) Default: -1

--bbmerge_adapters              : Absolute UNIX path pointing to the adapters
                                  file in FASTA format. Default: false

--bbmerge_ziplevel              : Set to 1 (lowest) through 9 (max) to change
                                  compression level; lower compression is
                                  faster. Default: 1

--bbmerge_ordered               : Output reads in the same order as input.
                                  Default: false

--bbmerge_qtrim                 : Trim read ends to remove bases with quality
                                  below --bbmerge_minq. Trims BEFORE merging
                                  . Values: t (trim both ends), f (neither
                                  end), r (right end only), l (left end only
                                  ). Default: true

--bbmerge_qtrim2                : May be specified instead of --bbmerge_qtrim
                                  to perform trimming only if merging is
                                  unsuccesful. then retry merging. Default:
                                  false

--bbmerge_trimq                 : Trim quality threshold. This may be comma-
                                  delimited list (ascending) to try multiple
                                  values. Default: 10

--bbmerge_minlength             : (ml) Reads shorter than this after trimming
                                  , but before merging, will be discarded.
                                  Pairs will be discarded onlyif both are
                                  shorter. Default: 1

--bbmerge_tbo                   : (trimbyoverlap). Trim overlapping reads to
                                  remove right most (3') non-overlaping
                                  portion instead of joining Default: false

--bbmerge_minavgquality         : (maq). Reads with average quality below
                                  this after trimming will not be attempted
                                  to merge. Default: 30

--bbmerge_trimpolya             : Trim trailing poly-A tail from adapter
                                  output. Only affects outadapter.  This also
                                  trims poly-A followed by poly-G, which
                                  occurs on NextSeq. Default: true

--bbmerge_pfilter               : Ban improbable overlaps. Higher is more
                                  strict. 0 will disable the filter; 1 will
                                  allow only perfect overlaps. Default: 1

--bbmerge_ouq                   : Calculate best overlap using quality values
                                  . Default: false

--bbmerge_owq                   : Calculate best overlap without using
                                  quality values. Default: true

--bbmerge_strict                : Decrease false positive rate and merging
                                  rate. Default: false

--bbmerge_verystrict            : Greatly decrease false positive rate and
                                  merging rate. Default: false

--bbmerge_ultrastrict           : Decrease false positive rate and merging
                                  rate even more. Default: true

--bbmerge_maxstrict             : Maxiamally decrease false positive rate and
                                  merging rate. Default: false

--bbmerge_loose                 : Increase false positive rate and merging
                                  rate. Default: false

--bbmerge_veryloose             : Greatly increase false positive rate and
                                  merging rate. Default: false

--bbmerge_ultraloose            : Increase false positive rate and merging
                                  rate even more. Default: false

--bbmerge_maxloose              : Maximally increase false positive rate and
                                  merging rate. Default: false

--bbmerge_fast                  : Fastest possible preset. Default: false

--bbmerge_k                     : Kmer length.  31 (or less) is fastest and
                                  uses the least memory, but higher values
                                  may be more accurate. 60 tends to work well
                                  for 150bp reads. Default: 60

--bbmerge_prealloc              : Pre-allocate memory rather than dynamically
                                  growing. Faster and more memory-efficient
                                  for large datasets. A float fraction (0-1)
                                  may be specified, default 1. Default: true

--fastp_run                     : Run fastp tool. Default: true

--fastp_failed_out              : Specify whether to store reads that cannot
                                  pass the filters. Default: false

--fastp_merged_out              : Specify whether to store merged output or
                                  not. Default: false

--fastp_overlapped_out          : For each read pair, output the overlapped
                                  region if it has no mismatched base.
                                  Default: false

--fastp_6                       : Indicate that the input is using phred64
                                  scoring (it'll be converted to phred33, so
                                  the output will still be phred33). Default
                                  : false

--fastp_reads_to_process        : Specify how many reads/pairs are to be
                                  processed. Default value 0 means process
                                  all reads. Default: 0

--fastp_fix_mgi_id              : The MGI FASTQ ID format is not compatible
                                  with many BAM operation tools, enable this
                                  option to fix it. Default: false

--fastp_A                       : Disable adapter trimming. On by default.
                                  Default: false

--fastp_adapter_fasta           : Specify a FASTA file to trim both read1 and
                                  read2 (if PE) by all the sequences in this
                                  FASTA file. Default: false

--fastp_f                       : Trim how many bases in front of read1.
                                  Default: 0

--fastp_t                       : Trim how many bases at the end of read1.
                                  Default: 0

--fastp_b                       : Max length of read1 after trimming. Default
                                  : 0

--fastp_F                       : Trim how many bases in front of read2.
                                  Default: 0

--fastp_T                       : Trim how many bases at the end of read2.
                                  Default: 0

--fastp_B                       : Max length of read2 after trimming. Default
                                  : 0

--fastp_dedup                   : Enable deduplication to drop the duplicated
                                  reads/pairs. Default: true

--fastp_dup_calc_accuracy       : Accuracy level to calculate duplication (1~
                                  6), higher level uses more memory (1G, 2G,
                                  4G, 8G, 16G, 24G). Default 1 for no-dedup
                                  mode, and 3 for dedup mode. Default: 6

--fastp_poly_g_min_len          : The minimum length to detect polyG in the
                                  read tail. Default: 10

--fastp_G                       : Disable polyG tail trimming. Default: true

--fastp_x                       : Enable polyX trimming in 3' ends. Default:
                                  false

--fastp_poly_x_min_len          : The minimum length to detect polyX in the
                                  read tail. Default: 10

--fastp_cut_front               : Move a sliding window from front (5') to
                                  tail, drop the bases in the window if its
                                  mean quality < threshold, stop otherwise.
                                  Default: true

--fastp_cut_tail                : Move a sliding window from tail (3') to
                                  front, drop the bases in the window if its
                                  mean quality < threshold, stop otherwise.
                                  Default: false

--fastp_cut_right               : Move a sliding window from tail, drop the
                                  bases in the window and the right part if
                                  its mean quality < threshold, and then stop
                                  . Default: true

--fastp_W                       : Sliding window size shared by --
                                  fastp_cut_front, --fastp_cut_tail and --
                                  fastp_cut_right. Default: 20

--fastp_M                       : The mean quality requirement shared by --
                                  fastp_cut_front, --fastp_cut_tail and --
                                  fastp_cut_right. Default: 30

--fastp_q                       : The quality value below which a base should
                                  is not qualified. Default: 30

--fastp_u                       : What percent of bases are allowed to be
                                  unqualified. Default: 40

--fastp_n                       : How many N's can a read have. Default: 5

--fastp_e                       : If the full reads' average quality is below
                                  this value, then it is discarded. Default
                                  : 0

--fastp_l                       : Reads shorter than this length will be
                                  discarded. Default: 35

--fastp_max_len                 : Reads longer than this length will be
                                  discarded. Default: 0

--fastp_y                       : Enable low complexity filter. The
                                  complexity is defined as the percentage of
                                  bases that are different from its next base
                                  (base[i] != base[i+1]). Default: true

--fastp_Y                       : The threshold for low complexity filter (0~
                                  100). Ex: A value of 30 means 30%
                                  complexity is required. Default: 30

--fastp_U                       : Enable Unique Molecular Identifier (UMI)
                                  pre-processing. Default: false

--fastp_umi_loc                 : Specify the location of UMI, can be one of
                                  index1/index2/read1/read2/per_index/
                                  per_read. Default: false

--fastp_umi_len                 : If the UMI is in read1 or read2, its length
                                  should be provided. Default: false

--fastp_umi_prefix              : If specified, an underline will be used to
                                  connect prefix and UMI (i.e. prefix=UMI,
                                  UMI=AATTCG, final=UMI_AATTCG). Default:
                                  false

--fastp_umi_skip                : If the UMI is in read1 or read2, fastp can
                                  skip several bases following the UMI.
                                  Default: false

--fastp_p                       : Enable overrepresented sequence analysis.
                                  Default: true

--fastp_P                       : One in this many number of reads will be
                                  computed for overrepresentation analysis (1
                                  ~10000), smaller is slower. Default: 20

--fastp_use_custom_adapaters    : Use custom adapter FASTA with fastp on top
                                  of built-in adapter sequence auto-detection
                                  . Enabling this option will attempt to find
                                  and remove all possible Illumina adapter
                                  and primer sequences but will make the
                                  workflow run slow. Default: false

--mashscreen_run                : Run `mash screen` tool. Default: true

--mashscreen_w                  : Winner-takes-all strategy for identity
                                  estimates. After counting hashes for each
                                  query, hashes that appear in multiple
                                  queries will be removed from all except the
                                  one with the best identity (ties broken by
                                  larger query), and other identities will
                                  be reduced. This removes output redundancy
                                  , providing a rough compositional outline
                                  .  Default: false

--mashscreen_i                  : Minimum identity to report. Inclusive
                                  unless set to zero, in which case only
                                  identities greater than zero (i.e. with at
                                  least one shared hash) will be reported.
                                  Set to -1 to output everything. (-1-1).
                                  Default: false

--mashscreen_v                  : Maximum p-value to report (0-1). Default:
                                  false

--tuspy_run                     : Run the get_top_unique_mash_hits_genomes.py
                                  script. Default: true

--tuspy_s                       : Absolute UNIX path to metadata text file
                                  with the field separator, | and 5 fields:
                                  serotype|asm_lvl|asm_url|snp_cluster_idEx:
                                  serotype=Derby,antigen_formula=4:f,g:-|
                                  Scaffold|402440|ftp://...|PDS000096654.2.
                                  Mentioning this option will create a pickle
                                  file for the provided metadata and exits.
                                  Default: false

--tuspy_m                       : Absolute UNIX path to mash screen results
                                  file. Default: false

--tuspy_ps                      : Absolute UNIX Path to serialized metadata
                                  object in a pickle file. Default: /hpc/db/
                                  bettercallsal/latest/index_metadata/
                                  per_snp_cluster.ACC2SERO.pickle

--tuspy_gd                      : Absolute UNIX Path to directory containing
                                  gzipped genome FASTA files. Default: /hpc/
                                  db/bettercallsal/latest/scaffold_genomes

--tuspy_gds                     : Genome FASTA file suffix to search for in
                                  the genome directory. Default:
                                  _scaffolded_genomic.fna.gz

--tuspy_n                       : Return up to this many number of top N
                                  unique genome accession hits. Default: 10

--sourmashsketch_run            : Run `sourmash sketch dna` tool. Default:
                                  true

--sourmashsketch_mode           : Select which type of signatures to be
                                  created: dna, protein, fromfile or
                                  translate. Default: dna

--sourmashsketch_p              : Signature parameters to use. Default: abund
                                  ,scaled=1000,k=51,k=61,k=71

--sourmashsketch_file           : <path>  A text file containing a list of
                                  sequence files to load. Default: false

--sourmashsketch_f              : Recompute signatures even if the file
                                  exists. Default: false

--sourmashsketch_merge          : Merge all input files into one signature
                                  file with the specified name. Default:
                                  false

--sourmashsketch_singleton      : Compute a signature for each sequence
                                  record individually. Default: true

--sourmashsketch_name           : Name the signature generated from each file
                                  after the first record in the file.
                                  Default: false

--sourmashsketch_randomize      : Shuffle the list of input files randomly.
                                  Default: false

--sourmashgather_run            : Run `sourmash gather` tool. Default: true

--sourmashgather_n              : Number of results to report. By default,
                                  will terminate at --sourmashgather_thr_bp
                                  value. Default: false

--sourmashgather_thr_bp         : Reporting threshold (in bp) for estimated
                                  overlap with remaining query. Default:
                                  false

--sourmashgather_ignoreabn      : Do NOT use k-mer abundances if present.
                                  Default: false

--sourmashgather_prefetch       : Use prefetch before gather. Default: false

--sourmashgather_noprefetch     : Do not use prefetch before gather. Default
                                  : false

--sourmashgather_ani_ci         : Output confidence intervals for ANI
                                  estimates. Default: true

--sourmashgather_k              : The k-mer size to select. Default: 71

--sourmashgather_protein        : Choose a protein signature. Default: false

--sourmashgather_noprotein      : Do not choose a protein signature. Default
                                  : false

--sourmashgather_dayhoff        : Choose Dayhoff-encoded amino acid
                                  signatures. Default: false

--sourmashgather_nodayhoff      : Do not choose Dayhoff-encoded amino acid
                                  signatures. Default: false

--sourmashgather_hp             : Choose hydrophobic-polar-encoded amino acid
                                  signatures. Default: false

--sourmashgather_nohp           : Do not choose hydrophobic-polar-encoded
                                  amino acid signatures. Default: false

--sourmashgather_dna            : Choose DNA signature. Default: true

--sourmashgather_nodna          : Do not choose DNA signature. Default: false

--sourmashgather_scaled         : Scaled value should be between 100 and 1e6
                                  . Default: false

--sourmashgather_inc_pat        : Search only signatures that match this
                                  pattern in name, filename, or md5. Default
                                  : false

--sourmashgather_exc_pat        : Search only signatures that do not match
                                  this pattern in name, filename, or md5.
                                  Default: false

--sourmashsearch_run            : Run `sourmash search` tool. Default: false

--sourmashsearch_n              : Number of results to report. By default,
                                  will terminate at --sourmashsearch_thr
                                  value. Default: false

--sourmashsearch_thr            : Reporting threshold (similarity) to return
                                  results. Default: 0

--sourmashsearch_contain        : Score based on containment rather than
                                  similarity. Default: false

--sourmashsearch_maxcontain     : Score based on max containment rather than
                                  similarity. Default: false

--sourmashsearch_ignoreabn      : Do NOT use k-mer abundances if present.
                                  Default: true

--sourmashsearch_ani_ci         : Output confidence intervals for ANI
                                  estimates. Default: false

--sourmashsearch_k              : The k-mer size to select. Default: 71

--sourmashsearch_protein        : Choose a protein signature. Default: false

--sourmashsearch_noprotein      : Do not choose a protein signature. Default
                                  : false

--sourmashsearch_dayhoff        : Choose Dayhoff-encoded amino acid
                                  signatures. Default: false

--sourmashsearch_nodayhoff      : Do not choose Dayhoff-encoded amino acid
                                  signatures. Default: false

--sourmashsearch_hp             : Choose hydrophobic-polar-encoded amino acid
                                  signatures. Default: false

--sourmashsearch_nohp           : Do not choose hydrophobic-polar-encoded
                                  amino acid signatures. Default: false

--sourmashsearch_dna            : Choose DNA signature. Default: true

--sourmashsearch_nodna          : Do not choose DNA signature. Default: false

--sourmashsearch_scaled         : Scaled value should be between 100 and 1e6
                                  . Default: false

--sourmashsearch_inc_pat        : Search only signatures that match this
                                  pattern in name, filename, or md5. Default
                                  : false

--sourmashsearch_exc_pat        : Search only signatures that do not match
                                  this pattern in name, filename, or md5.
                                  Default: false

--sfhpy_run                     : Run the sourmash_filter_hits.py script.
                                  Default: true

--sfhpy_fcn                     : Column name by which filtering of rows
                                  should be applied. Default: f_match

--sfhpy_fcv                     : Remove genomes whose match with the query
                                  FASTQ is less than this much. Default: 0.1

--sfhpy_gt                      : Apply greather than or equal to condition
                                  on numeric values of --sfhpy_fcn column.
                                  Default: true

--sfhpy_lt                      : Apply less than or equal to condition on
                                  numeric values of --sfhpy_fcn column.
                                  Default: false

--kmaindex_run                  : Run kma index tool. Default: true

--kmaindex_t_db                 : Add to existing DB. Default: false

--kmaindex_k                    : k-mer size. Default: 31

--kmaindex_m                    : Minimizer size. Default: false

--kmaindex_hc                   : Homopolymer compression. Default: false

--kmaindex_ML                   : Minimum length of templates. Defaults to --
                                  kmaindex_k Default: false

--kmaindex_ME                   : Mega DB. Default: false

--kmaindex_Sparse               : Make Sparse DB. Default: false

--kmaindex_ht                   : Homology template. Default: false

--kmaindex_hq                   : Homology query. Default: false

--kmaindex_and                  : Both homology thresholds have to reach.
                                  Default: false

--kmaindex_nbp                  : No bias print. Default: false

--kmaalign_run                  : Run kma tool. Default: true

--kmaalign_int                  : Input file has interleaved reads.  Default
                                  : false

--kmaalign_ef                   : Output additional features. Default: false

--kmaalign_vcf                  : Output vcf file. 2 to apply FT. Default:
                                  false

--kmaalign_sam                  : Output SAM, 4/2096 for mapped/aligned.
                                  Default: false

--kmaalign_nc                   : No consensus file. Default: true

--kmaalign_na                   : No aln file. Default: true

--kmaalign_nf                   : No frag file. Default: true

--kmaalign_a                    : Output all template mappings. Default:
                                  false

--kmaalign_and                  : Use both -mrs and p-value on consensus.
                                  Default: false

--kmaalign_oa                   : Use neither -mrs or p-value on consensus.
                                  Default: false

--kmaalign_bc                   : Minimum support to call bases. Default:
                                  false

--kmaalign_bcNano               : Altered indel calling for ONT data. Default
                                  : false

--kmaalign_bcd                  : Minimum depth to call bases. Default: false

--kmaalign_bcg                  : Maintain insignificant gaps. Default: false

--kmaalign_ID                   : Minimum consensus ID. Default: false

--kmaalign_md                   : Minimum depth. Default: false

--kmaalign_dense                : Skip insertion in consensus. Default: false

--kmaalign_ref_fsa              : Use Ns on indels. Default: false

--kmaalign_Mt1                  : Map everything to one template. Default:
                                  false

--kmaalign_1t1                  : Map one query to one template. Default:
                                  false

--kmaalign_mrs                  : Minimum relative alignment score. Default:
                                  false

--kmaalign_mrc                  : Minimum query coverage. Default: 0.99

--kmaalign_mp                   : Minimum phred score of trailing and leading
                                  bases. Default: 30

--kmaalign_mq                   : Set the minimum mapping quality. Default:
                                  false

--kmaalign_eq                   : Minimum average quality score. Default: 30

--kmaalign_5p                   : Trim 5 prime by this many bases. Default:
                                  false

--kmaalign_3p                   : Trim 3 prime by this many bases Default:
                                  false

--kmaalign_apm                  : Sets both -pm and -fpm Default: false

--kmaalign_cge                  : Set CGE penalties and rewards Default:
                                  false

--salmonidx_run                 : Run `salmon index` tool. Default: true

--salmonidx_k                   : The size of k-mers that should be used for
                                  the  quasi index. Default: false

--salmonidx_gencode             : This flag will expect the input transcript
                                  FASTA to be in GENCODE format, and will
                                  split the transcript name at the first `|`
                                  character. These reduced names will be used
                                  in the output and when looking for these
                                  transcripts in a gene to transcript GTF.
                                  Default: false

--salmonidx_features            : This flag will expect the input reference
                                  to be in the tsv file format, and will
                                  split the feature name at the first `tab`
                                  character. These reduced names will be used
                                  in the output and when looking for the
                                  sequence of the features. GTF. Default:
                                  false

--salmonidx_keepDuplicates      : This flag will disable the default indexing
                                  behavior of discarding sequence-identical
                                  duplicate transcripts. If this flag is
                                  passed then duplicate transcripts that
                                  appear in the input will be retained and
                                  quantified separately. Default: false

--salmonidx_keepFixedFasta      : Retain the fixed fasta file (without short
                                  transcripts and duplicates, clipped, etc.)
                                  generated during indexing. Default: false

--salmonidx_filterSize          : The size of the Bloom filter that will be
                                  used by TwoPaCo during indexing. The filter
                                  will be of size 2^{filterSize}. A value of
                                  -1 means that the filter size will be
                                  automatically set based on the number of
                                  distinct k-mers in the input, as estimated
                                  by nthll. Default: false

--salmonidx_sparse              : Build the index using a sparse sampling of
                                  k-mer positions This will require less
                                  memory (especially during quantification),
                                  but will take longer to constructand can
                                  slow down mapping / alignment. Default:
                                  false

--salmonidx_n                   : Do not clip poly-A tails from the ends of
                                  target sequences. Default: false

--gsrpy_run                     : Run the gen_salmon_res_table.py script.
                                  Default: true

--gsrpy_url                     : Generate an additional column in final
                                  results table which links out to NCBI
                                  Pathogens Isolate Browser.  Default: true

Help options                    :

--help                          : Display this message.

```
author	kkonganti
date	Mon, 05 Jun 2023 18:48:51 -0400
parents
children