kkonganti@1: # bettercallsal kkonganti@1: kkonganti@1: `bettercallsal` is an automated workflow to assign Salmonella serotype based on [NCBI Pathogens Database](https://www.ncbi.nlm.nih.gov/pathogens). It uses `MASH` to reduce the search space followed by additional genome filtering with `sourmash`. It then performs genome based alignment with `kma` followed by count generation using `salmon`. This workflow is especially useful in a case where a sample is of multi-serovar mixture. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: kkonganti@1: kkonganti@1: - [Minimum Requirements](#minimum-requirements) kkonganti@1: - [Usage and Examples](#usage-and-examples) kkonganti@1: - [Database](#database) kkonganti@1: - [Input](#input) kkonganti@1: - [Output](#output) kkonganti@1: - [Computational resources](#computational-resources) kkonganti@1: - [Runtime profiles](#runtime-profiles) kkonganti@1: - [your_institution.config](#your_institutionconfig) kkonganti@1: - [Cloud computing](#cloud-computing) kkonganti@1: - [Example data](#example-data) kkonganti@1: - [Using sourmash](#using-sourmash) kkonganti@1: - [bettercallsal CLI Help](#bettercallsal-cli-help) kkonganti@1: kkonganti@1: kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ## Minimum Requirements kkonganti@1: kkonganti@1: 1. [Nextflow version 22.10.0](https://github.com/nextflow-io/nextflow/releases/download/v22.10.0/nextflow). kkonganti@1: - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`. kkonganti@1: - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz). kkonganti@1: 2. Either of `micromamba` or `docker` or `singularity` installed and made available in your `$PATH`. kkonganti@1: - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers. kkonganti@1: - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation.html#manual-installation) and make sure that the `micromamba` binary is made available in your `$PATH`. kkonganti@1: - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned. kkonganti@1: 3. Minimum of 10 CPU cores and about 16 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ## Usage and Examples kkonganti@1: kkonganti@1: Clone or download this repository and then call `cpipes`. kkonganti@1: kkonganti@1: ```bash kkonganti@1: cpipes --pipeline bettercallsal [options] kkonganti@1: ``` kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: **Example**: Run the default `bettercallsal` pipeline in single-end mode. kkonganti@1: kkonganti@1: ```bash kkonganti@1: cd /data/scratch/$USER kkonganti@1: mkdir nf-cpipes kkonganti@1: cd nf-cpipes kkonganti@1: cpipes kkonganti@1: --pipeline bettercallsal \ kkonganti@1: --input /path/to/illumina/fastq/dir \ kkonganti@1: --output /path/to/output \ kkonganti@1: --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db kkonganti@1: ``` kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: **Example**: Run the `bettercallsal` pipeline in paired-end mode. In this mode, the `R1` and `R2` files are concatenated. We have found that concatenated reads yields better calling rates. Please refer to the **Methods** and the **Results** section in our [preprint](https://www.biorxiv.org/content/10.1101/2023.04.06.535929v1.full) for more information. Users can still choose to use `bbmerge.sh` by adding the following options on the command-line: `--bbmerge_run true --bcs_concat_pe false`. kkonganti@1: kkonganti@1: ```bash kkonganti@1: cd /data/scratch/$USER kkonganti@1: mkdir nf-cpipes kkonganti@1: cd nf-cpipes kkonganti@1: cpipes \ kkonganti@1: --pipeline bettercallsal \ kkonganti@1: --input /path/to/illumina/fastq/dir \ kkonganti@1: --output /path/to/output \ kkonganti@1: --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db \ kkonganti@1: --fq_single_end false \ kkonganti@1: --fq_suffix '_R1_001.fastq.gz' kkonganti@1: ``` kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### Database kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: The successful run of the workflow requires certain database flat files specific for the workflow. kkonganti@1: kkonganti@1: Please refer to `bettercallsal_db` [README](./bettercallsal_db.md) if you would like to run the workflow on the latest version of the **PDG** release. kkonganti@1: kkonganti@1:   kkonganti@1: kkonganti@1: ### Input kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1. kkonganti@1: kkonganti@1: For example, if the directory contains FASTQ files as shown below: kkonganti@1: kkonganti@1: - KB-01_apple_L001_R1.fastq.gz kkonganti@1: - KB-01_apple_L001_R2.fastq.gz kkonganti@1: - KB-01_apple_L002_R1.fastq.gz kkonganti@1: - KB-01_apple_L002_R2.fastq.gz kkonganti@1: - KB-02_mango_L001_R1.fastq.gz kkonganti@1: - KB-02_mango_L001_R2.fastq.gz kkonganti@1: - KB-02_mango_L002_R1.fastq.gz kkonganti@1: - KB-02_mango_L002_R2.fastq.gz kkonganti@1: kkonganti@1: Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`). kkonganti@1: kkonganti@1: This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### Output kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `bettercallsal-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### Computational resources kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: The workflow `bettercallsal` requires at least a minimum of 16 GBs of memory to successfully finish the workflow. By default, `bettercallsal` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: Example: kkonganti@1: kkonganti@1: ```bash kkonganti@1: cpipes \ kkonganti@1: --pipeline bettercallsal \ kkonganti@1: --input /path/to/bettercallsal_sim_reads \ kkonganti@1: --output /path/to/bettercallsal_sim_reads_output \ kkonganti@1: --bcs_root_dbdir /path/to/PDG000000002.2537 kkonganti@1: --kmaalign_ignorequals \ kkonganti@1: --max_cpus 5 \ kkonganti@1: -profile stdkondagac \ kkonganti@1: -resume kkonganti@1: ``` kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### Runtime profiles kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: Example: kkonganti@1: kkonganti@1: ```bash kkonganti@1: cd /data/scratch/$USER kkonganti@1: mkdir nf-cpipes kkonganti@1: cd nf-cpipes kkonganti@1: cpipes \ kkonganti@1: --pipeline bettercallsal \ kkonganti@1: --input /path/to/fastq_pass_dir \ kkonganti@1: --output /path/to/where/output/should/go \ kkonganti@1: -profile your_institution kkonganti@1: ``` kkonganti@1: kkonganti@1: The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-bettercallsal` would hold all the **NEXTFLOW** related logs, reports and trace files. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### `your_institution.config` kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines: kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ```groovy kkonganti@1: your_institution { kkonganti@1: process.executor = 'sge' kkonganti@1: process.queue = 'normal.q' kkonganti@1: singularity.enabled = false kkonganti@1: singularity.autoMounts = true kkonganti@1: docker.enabled = false kkonganti@1: params.enable_conda = true kkonganti@1: conda.enabled = true kkonganti@1: conda.useMicromamba = true kkonganti@1: params.enable_module = false kkonganti@1: } kkonganti@1: ``` kkonganti@1: kkonganti@1: In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `bettercallsal` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### Cloud computing kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html): kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: Example: kkonganti@1: kkonganti@1: ```groovy kkonganti@1: my_aws_batch { kkonganti@1: executor = 'awsbatch' kkonganti@1: queue = 'my-batch-queue' kkonganti@1: aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws' kkonganti@1: aws.batch.region = 'us-east-1' kkonganti@1: singularity.enabled = false kkonganti@1: singularity.autoMounts = true kkonganti@1: docker.enabled = true kkonganti@1: params.conda_enabled = false kkonganti@1: params.enable_module = false kkonganti@1: } kkonganti@1: ``` kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ### Example data kkonganti@1: kkonganti@1: --- kkonganti@1: kkonganti@1: After you make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow, you can try the `bettercallsal` pipeline on some simulated reads. The following input dataset contains simulated reads for `Montevideo` and `I 4,[5],12:i:-` in about roughly equal proportions. kkonganti@1: kkonganti@1: - Download simulated reads: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads.tar.bz2) (~ 3 GB). kkonganti@1: - Download pre-formatted test database: [S3](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2491.test-db.tar.bz2) (~ 75 MB). This test database works only with the simulated reads. kkonganti@1: - Download pre-formatted full database (**Optional**): If you would like to do a complete run with your own **FASTQ** datasets, you can either create your own [database](./bettercallsal_db.md) or use [PDG000000002.2537](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/PDG000000002.2537.tar.bz2) version of the database (~ 37 GB). kkonganti@1: - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://cfsan-pub-xfer.s3.amazonaws.com/Kranti.Konganti/bettercallsal/bettercallsal_sim_reads_mqc.html). kkonganti@1: kkonganti@1: Now run the workflow by ignoring quality values since these are simulated base qualities: kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ```bash kkonganti@1: cpipes \ kkonganti@1: --pipeline bettercallsal \ kkonganti@1: --input /path/to/bettercallsal_sim_reads \ kkonganti@1: --output /path/to/bettercallsal_sim_reads_output \ kkonganti@1: --bcs_root_dbdir /path/to/PDG000000002.2537 kkonganti@1: --kmaalign_ignorequals \ kkonganti@1: -profile stdkondagac \ kkonganti@1: -resume kkonganti@1: ``` kkonganti@1: kkonganti@1: Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ## Using `sourmash` kkonganti@1: kkonganti@1: Beginning with `v0.3.0` of `bettercallsal` workflow, `sourmash` sketching is used to further narrow down possible serotype hits. It is **ON** by default. This will enable the generation of **ANI Containment** matrix for **Samples** vs **Genomes**. There may be multiple hits for the same serotype in the final **MultiQC** report as multiple genome accessions can belong to a single serotype. kkonganti@1: kkonganti@1: You can turn **OFF** this feature with `--sourmashsketch_run false` option. kkonganti@1: kkonganti@1: \ kkonganti@1:   kkonganti@1: kkonganti@1: ## `bettercallsal` CLI Help kkonganti@1: kkonganti@1: ```text kkonganti@1: [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal --help kkonganti@1: N E X T F L O W ~ version 22.10.0 kkonganti@1: Launching `./bettercallsal/cpipes` [awesome_chandrasekhar] DSL2 - revision: 8da4e11078 kkonganti@1: ================================================================================ kkonganti@1: (o) kkonganti@1: ___ _ __ _ _ __ ___ ___ kkonganti@1: / __|| '_ \ | || '_ \ / _ \/ __| kkonganti@1: | (__ | |_) || || |_) || __/\__ \ kkonganti@1: \___|| .__/ |_|| .__/ \___||___/ kkonganti@1: | | | | kkonganti@1: |_| |_| kkonganti@1: -------------------------------------------------------------------------------- kkonganti@1: A collection of modular pipelines at CFSAN, FDA. kkonganti@1: -------------------------------------------------------------------------------- kkonganti@1: Name : CPIPES kkonganti@1: Author : Kranti Konganti kkonganti@1: Version : 0.5.0 kkonganti@1: Center : CFSAN, FDA. kkonganti@1: ================================================================================ kkonganti@1: kkonganti@1: Workflow : bettercallsal kkonganti@1: kkonganti@1: Author : Kranti Konganti kkonganti@1: kkonganti@1: Version : 0.5.0 kkonganti@1: kkonganti@1: kkonganti@1: Usage : cpipes --pipeline bettercallsal [options] kkonganti@1: kkonganti@1: kkonganti@1: Required : kkonganti@1: kkonganti@1: --input : Absolute path to directory containing FASTQ kkonganti@1: files. The directory should contain only kkonganti@1: FASTQ files as all the files within the kkonganti@1: mentioned directory will be read. Ex: -- kkonganti@1: input /path/to/fastq_pass kkonganti@1: kkonganti@1: --output : Absolute path to directory where all the kkonganti@1: pipeline outputs should be stored. Ex: -- kkonganti@1: output /path/to/output kkonganti@1: kkonganti@1: Other options : kkonganti@1: kkonganti@1: --metadata : Absolute path to metadata CSV file kkonganti@1: containing five mandatory columns: sample, kkonganti@1: fq1,fq2,strandedness,single_end. The fq1 kkonganti@1: and fq2 columns contain absolute paths to kkonganti@1: the FASTQ files. This option can be used in kkonganti@1: place of --input option. This is rare. Ex kkonganti@1: : --metadata samplesheet.csv kkonganti@1: kkonganti@1: --fq_suffix : The suffix of FASTQ files (Unpaired reads kkonganti@1: or R1 reads or Long reads) if an input kkonganti@1: directory is mentioned via --input option. kkonganti@1: Default: .fastq.gz kkonganti@1: kkonganti@1: --fq2_suffix : The suffix of FASTQ files (Paired-end reads kkonganti@1: or R2 reads) if an input directory is kkonganti@1: mentioned via --input option. Default: kkonganti@1: _R2_001.fastq.gz kkonganti@1: kkonganti@1: --fq_filter_by_len : Remove FASTQ reads that are less than this kkonganti@1: many bases. Default: 0 kkonganti@1: kkonganti@1: --fq_strandedness : The strandedness of the sequencing run. kkonganti@1: This is mostly needed if your sequencing kkonganti@1: run is RNA-SEQ. For most of the other runs kkonganti@1: , it is probably safe to use unstranded for kkonganti@1: the option. Default: unstranded kkonganti@1: kkonganti@1: --fq_single_end : SINGLE-END information will be auto- kkonganti@1: detected but this option forces PAIRED-END kkonganti@1: FASTQ files to be treated as SINGLE-END so kkonganti@1: only read 1 information is included in auto kkonganti@1: -generated samplesheet. Default: true kkonganti@1: kkonganti@1: --fq_filename_delim : Delimiter by which the file name is split kkonganti@1: to obtain sample name. Default: _ kkonganti@1: kkonganti@1: --fq_filename_delim_idx : After splitting FASTQ file name by using kkonganti@1: the --fq_filename_delim option, all kkonganti@1: elements before this index (1-based) will kkonganti@1: be joined to create final sample name. kkonganti@1: Default: 1 kkonganti@1: kkonganti@1: --bcs_concat_pe : Concatenate paired-end files. Default: true kkonganti@1: kkonganti@1: --bbmerge_run : Run BBMerge tool. Default: false kkonganti@1: kkonganti@1: --bbmerge_reads : Quit after this many read pairs (-1 means kkonganti@1: all) Default: -1 kkonganti@1: kkonganti@1: --bbmerge_adapters : Absolute UNIX path pointing to the adapters kkonganti@1: file in FASTA format. Default: false kkonganti@1: kkonganti@1: --bbmerge_ziplevel : Set to 1 (lowest) through 9 (max) to change kkonganti@1: compression level; lower compression is kkonganti@1: faster. Default: 1 kkonganti@1: kkonganti@1: --bbmerge_ordered : Output reads in the same order as input. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --bbmerge_qtrim : Trim read ends to remove bases with quality kkonganti@1: below --bbmerge_minq. Trims BEFORE merging kkonganti@1: . Values: t (trim both ends), f (neither kkonganti@1: end), r (right end only), l (left end only kkonganti@1: ). Default: true kkonganti@1: kkonganti@1: --bbmerge_qtrim2 : May be specified instead of --bbmerge_qtrim kkonganti@1: to perform trimming only if merging is kkonganti@1: unsuccesful. then retry merging. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --bbmerge_trimq : Trim quality threshold. This may be comma- kkonganti@1: delimited list (ascending) to try multiple kkonganti@1: values. Default: 10 kkonganti@1: kkonganti@1: --bbmerge_minlength : (ml) Reads shorter than this after trimming kkonganti@1: , but before merging, will be discarded. kkonganti@1: Pairs will be discarded onlyif both are kkonganti@1: shorter. Default: 1 kkonganti@1: kkonganti@1: --bbmerge_tbo : (trimbyoverlap). Trim overlapping reads to kkonganti@1: remove right most (3') non-overlaping kkonganti@1: portion instead of joining Default: false kkonganti@1: kkonganti@1: --bbmerge_minavgquality : (maq). Reads with average quality below kkonganti@1: this after trimming will not be attempted kkonganti@1: to merge. Default: 30 kkonganti@1: kkonganti@1: --bbmerge_trimpolya : Trim trailing poly-A tail from adapter kkonganti@1: output. Only affects outadapter. This also kkonganti@1: trims poly-A followed by poly-G, which kkonganti@1: occurs on NextSeq. Default: true kkonganti@1: kkonganti@1: --bbmerge_pfilter : Ban improbable overlaps. Higher is more kkonganti@1: strict. 0 will disable the filter; 1 will kkonganti@1: allow only perfect overlaps. Default: 1 kkonganti@1: kkonganti@1: --bbmerge_ouq : Calculate best overlap using quality values kkonganti@1: . Default: false kkonganti@1: kkonganti@1: --bbmerge_owq : Calculate best overlap without using kkonganti@1: quality values. Default: true kkonganti@1: kkonganti@1: --bbmerge_strict : Decrease false positive rate and merging kkonganti@1: rate. Default: false kkonganti@1: kkonganti@1: --bbmerge_verystrict : Greatly decrease false positive rate and kkonganti@1: merging rate. Default: false kkonganti@1: kkonganti@1: --bbmerge_ultrastrict : Decrease false positive rate and merging kkonganti@1: rate even more. Default: true kkonganti@1: kkonganti@1: --bbmerge_maxstrict : Maxiamally decrease false positive rate and kkonganti@1: merging rate. Default: false kkonganti@1: kkonganti@1: --bbmerge_loose : Increase false positive rate and merging kkonganti@1: rate. Default: false kkonganti@1: kkonganti@1: --bbmerge_veryloose : Greatly increase false positive rate and kkonganti@1: merging rate. Default: false kkonganti@1: kkonganti@1: --bbmerge_ultraloose : Increase false positive rate and merging kkonganti@1: rate even more. Default: false kkonganti@1: kkonganti@1: --bbmerge_maxloose : Maximally increase false positive rate and kkonganti@1: merging rate. Default: false kkonganti@1: kkonganti@1: --bbmerge_fast : Fastest possible preset. Default: false kkonganti@1: kkonganti@1: --bbmerge_k : Kmer length. 31 (or less) is fastest and kkonganti@1: uses the least memory, but higher values kkonganti@1: may be more accurate. 60 tends to work well kkonganti@1: for 150bp reads. Default: 60 kkonganti@1: kkonganti@1: --bbmerge_prealloc : Pre-allocate memory rather than dynamically kkonganti@1: growing. Faster and more memory-efficient kkonganti@1: for large datasets. A float fraction (0-1) kkonganti@1: may be specified, default 1. Default: true kkonganti@1: kkonganti@1: --fastp_run : Run fastp tool. Default: true kkonganti@1: kkonganti@1: --fastp_failed_out : Specify whether to store reads that cannot kkonganti@1: pass the filters. Default: false kkonganti@1: kkonganti@1: --fastp_merged_out : Specify whether to store merged output or kkonganti@1: not. Default: false kkonganti@1: kkonganti@1: --fastp_overlapped_out : For each read pair, output the overlapped kkonganti@1: region if it has no mismatched base. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --fastp_6 : Indicate that the input is using phred64 kkonganti@1: scoring (it'll be converted to phred33, so kkonganti@1: the output will still be phred33). Default kkonganti@1: : false kkonganti@1: kkonganti@1: --fastp_reads_to_process : Specify how many reads/pairs are to be kkonganti@1: processed. Default value 0 means process kkonganti@1: all reads. Default: 0 kkonganti@1: kkonganti@1: --fastp_fix_mgi_id : The MGI FASTQ ID format is not compatible kkonganti@1: with many BAM operation tools, enable this kkonganti@1: option to fix it. Default: false kkonganti@1: kkonganti@1: --fastp_A : Disable adapter trimming. On by default. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --fastp_adapter_fasta : Specify a FASTA file to trim both read1 and kkonganti@1: read2 (if PE) by all the sequences in this kkonganti@1: FASTA file. Default: false kkonganti@1: kkonganti@1: --fastp_f : Trim how many bases in front of read1. kkonganti@1: Default: 0 kkonganti@1: kkonganti@1: --fastp_t : Trim how many bases at the end of read1. kkonganti@1: Default: 0 kkonganti@1: kkonganti@1: --fastp_b : Max length of read1 after trimming. Default kkonganti@1: : 0 kkonganti@1: kkonganti@1: --fastp_F : Trim how many bases in front of read2. kkonganti@1: Default: 0 kkonganti@1: kkonganti@1: --fastp_T : Trim how many bases at the end of read2. kkonganti@1: Default: 0 kkonganti@1: kkonganti@1: --fastp_B : Max length of read2 after trimming. Default kkonganti@1: : 0 kkonganti@1: kkonganti@1: --fastp_dedup : Enable deduplication to drop the duplicated kkonganti@1: reads/pairs. Default: true kkonganti@1: kkonganti@1: --fastp_dup_calc_accuracy : Accuracy level to calculate duplication (1~ kkonganti@1: 6), higher level uses more memory (1G, 2G, kkonganti@1: 4G, 8G, 16G, 24G). Default 1 for no-dedup kkonganti@1: mode, and 3 for dedup mode. Default: 6 kkonganti@1: kkonganti@1: --fastp_poly_g_min_len : The minimum length to detect polyG in the kkonganti@1: read tail. Default: 10 kkonganti@1: kkonganti@1: --fastp_G : Disable polyG tail trimming. Default: true kkonganti@1: kkonganti@1: --fastp_x : Enable polyX trimming in 3' ends. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --fastp_poly_x_min_len : The minimum length to detect polyX in the kkonganti@1: read tail. Default: 10 kkonganti@1: kkonganti@1: --fastp_cut_front : Move a sliding window from front (5') to kkonganti@1: tail, drop the bases in the window if its kkonganti@1: mean quality < threshold, stop otherwise. kkonganti@1: Default: true kkonganti@1: kkonganti@1: --fastp_cut_tail : Move a sliding window from tail (3') to kkonganti@1: front, drop the bases in the window if its kkonganti@1: mean quality < threshold, stop otherwise. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --fastp_cut_right : Move a sliding window from tail, drop the kkonganti@1: bases in the window and the right part if kkonganti@1: its mean quality < threshold, and then stop kkonganti@1: . Default: true kkonganti@1: kkonganti@1: --fastp_W : Sliding window size shared by -- kkonganti@1: fastp_cut_front, --fastp_cut_tail and -- kkonganti@1: fastp_cut_right. Default: 20 kkonganti@1: kkonganti@1: --fastp_M : The mean quality requirement shared by -- kkonganti@1: fastp_cut_front, --fastp_cut_tail and -- kkonganti@1: fastp_cut_right. Default: 30 kkonganti@1: kkonganti@1: --fastp_q : The quality value below which a base should kkonganti@1: is not qualified. Default: 30 kkonganti@1: kkonganti@1: --fastp_u : What percent of bases are allowed to be kkonganti@1: unqualified. Default: 40 kkonganti@1: kkonganti@1: --fastp_n : How many N's can a read have. Default: 5 kkonganti@1: kkonganti@1: --fastp_e : If the full reads' average quality is below kkonganti@1: this value, then it is discarded. Default kkonganti@1: : 0 kkonganti@1: kkonganti@1: --fastp_l : Reads shorter than this length will be kkonganti@1: discarded. Default: 35 kkonganti@1: kkonganti@1: --fastp_max_len : Reads longer than this length will be kkonganti@1: discarded. Default: 0 kkonganti@1: kkonganti@1: --fastp_y : Enable low complexity filter. The kkonganti@1: complexity is defined as the percentage of kkonganti@1: bases that are different from its next base kkonganti@1: (base[i] != base[i+1]). Default: true kkonganti@1: kkonganti@1: --fastp_Y : The threshold for low complexity filter (0~ kkonganti@1: 100). Ex: A value of 30 means 30% kkonganti@1: complexity is required. Default: 30 kkonganti@1: kkonganti@1: --fastp_U : Enable Unique Molecular Identifier (UMI) kkonganti@1: pre-processing. Default: false kkonganti@1: kkonganti@1: --fastp_umi_loc : Specify the location of UMI, can be one of kkonganti@1: index1/index2/read1/read2/per_index/ kkonganti@1: per_read. Default: false kkonganti@1: kkonganti@1: --fastp_umi_len : If the UMI is in read1 or read2, its length kkonganti@1: should be provided. Default: false kkonganti@1: kkonganti@1: --fastp_umi_prefix : If specified, an underline will be used to kkonganti@1: connect prefix and UMI (i.e. prefix=UMI, kkonganti@1: UMI=AATTCG, final=UMI_AATTCG). Default: kkonganti@1: false kkonganti@1: kkonganti@1: --fastp_umi_skip : If the UMI is in read1 or read2, fastp can kkonganti@1: skip several bases following the UMI. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --fastp_p : Enable overrepresented sequence analysis. kkonganti@1: Default: true kkonganti@1: kkonganti@1: --fastp_P : One in this many number of reads will be kkonganti@1: computed for overrepresentation analysis (1 kkonganti@1: ~10000), smaller is slower. Default: 20 kkonganti@1: kkonganti@1: --fastp_use_custom_adapaters : Use custom adapter FASTA with fastp on top kkonganti@1: of built-in adapter sequence auto-detection kkonganti@1: . Enabling this option will attempt to find kkonganti@1: and remove all possible Illumina adapter kkonganti@1: and primer sequences but will make the kkonganti@1: workflow run slow. Default: false kkonganti@1: kkonganti@1: --mashscreen_run : Run `mash screen` tool. Default: true kkonganti@1: kkonganti@1: --mashscreen_w : Winner-takes-all strategy for identity kkonganti@1: estimates. After counting hashes for each kkonganti@1: query, hashes that appear in multiple kkonganti@1: queries will be removed from all except the kkonganti@1: one with the best identity (ties broken by kkonganti@1: larger query), and other identities will kkonganti@1: be reduced. This removes output redundancy kkonganti@1: , providing a rough compositional outline kkonganti@1: . Default: false kkonganti@1: kkonganti@1: --mashscreen_i : Minimum identity to report. Inclusive kkonganti@1: unless set to zero, in which case only kkonganti@1: identities greater than zero (i.e. with at kkonganti@1: least one shared hash) will be reported. kkonganti@1: Set to -1 to output everything. (-1-1). kkonganti@1: Default: false kkonganti@1: kkonganti@1: --mashscreen_v : Maximum p-value to report (0-1). Default: kkonganti@1: false kkonganti@1: kkonganti@1: --tuspy_run : Run the get_top_unique_mash_hits_genomes.py kkonganti@1: script. Default: true kkonganti@1: kkonganti@1: --tuspy_s : Absolute UNIX path to metadata text file kkonganti@1: with the field separator, | and 5 fields: kkonganti@1: serotype|asm_lvl|asm_url|snp_cluster_idEx: kkonganti@1: serotype=Derby,antigen_formula=4:f,g:-| kkonganti@1: Scaffold|402440|ftp://...|PDS000096654.2. kkonganti@1: Mentioning this option will create a pickle kkonganti@1: file for the provided metadata and exits. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --tuspy_m : Absolute UNIX path to mash screen results kkonganti@1: file. Default: false kkonganti@1: kkonganti@1: --tuspy_ps : Absolute UNIX Path to serialized metadata kkonganti@1: object in a pickle file. Default: /hpc/db/ kkonganti@1: bettercallsal/latest/index_metadata/ kkonganti@1: per_snp_cluster.ACC2SERO.pickle kkonganti@1: kkonganti@1: --tuspy_gd : Absolute UNIX Path to directory containing kkonganti@1: gzipped genome FASTA files. Default: /hpc/ kkonganti@1: db/bettercallsal/latest/scaffold_genomes kkonganti@1: kkonganti@1: --tuspy_gds : Genome FASTA file suffix to search for in kkonganti@1: the genome directory. Default: kkonganti@1: _scaffolded_genomic.fna.gz kkonganti@1: kkonganti@1: --tuspy_n : Return up to this many number of top N kkonganti@1: unique genome accession hits. Default: 10 kkonganti@1: kkonganti@1: --sourmashsketch_run : Run `sourmash sketch dna` tool. Default: kkonganti@1: true kkonganti@1: kkonganti@1: --sourmashsketch_mode : Select which type of signatures to be kkonganti@1: created: dna, protein, fromfile or kkonganti@1: translate. Default: dna kkonganti@1: kkonganti@1: --sourmashsketch_p : Signature parameters to use. Default: abund kkonganti@1: ,scaled=1000,k=51,k=61,k=71 kkonganti@1: kkonganti@1: --sourmashsketch_file : A text file containing a list of kkonganti@1: sequence files to load. Default: false kkonganti@1: kkonganti@1: --sourmashsketch_f : Recompute signatures even if the file kkonganti@1: exists. Default: false kkonganti@1: kkonganti@1: --sourmashsketch_merge : Merge all input files into one signature kkonganti@1: file with the specified name. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --sourmashsketch_singleton : Compute a signature for each sequence kkonganti@1: record individually. Default: true kkonganti@1: kkonganti@1: --sourmashsketch_name : Name the signature generated from each file kkonganti@1: after the first record in the file. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --sourmashsketch_randomize : Shuffle the list of input files randomly. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --sourmashgather_run : Run `sourmash gather` tool. Default: true kkonganti@1: kkonganti@1: --sourmashgather_n : Number of results to report. By default, kkonganti@1: will terminate at --sourmashgather_thr_bp kkonganti@1: value. Default: false kkonganti@1: kkonganti@1: --sourmashgather_thr_bp : Reporting threshold (in bp) for estimated kkonganti@1: overlap with remaining query. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --sourmashgather_ignoreabn : Do NOT use k-mer abundances if present. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --sourmashgather_prefetch : Use prefetch before gather. Default: false kkonganti@1: kkonganti@1: --sourmashgather_noprefetch : Do not use prefetch before gather. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --sourmashgather_ani_ci : Output confidence intervals for ANI kkonganti@1: estimates. Default: true kkonganti@1: kkonganti@1: --sourmashgather_k : The k-mer size to select. Default: 71 kkonganti@1: kkonganti@1: --sourmashgather_protein : Choose a protein signature. Default: false kkonganti@1: kkonganti@1: --sourmashgather_noprotein : Do not choose a protein signature. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --sourmashgather_dayhoff : Choose Dayhoff-encoded amino acid kkonganti@1: signatures. Default: false kkonganti@1: kkonganti@1: --sourmashgather_nodayhoff : Do not choose Dayhoff-encoded amino acid kkonganti@1: signatures. Default: false kkonganti@1: kkonganti@1: --sourmashgather_hp : Choose hydrophobic-polar-encoded amino acid kkonganti@1: signatures. Default: false kkonganti@1: kkonganti@1: --sourmashgather_nohp : Do not choose hydrophobic-polar-encoded kkonganti@1: amino acid signatures. Default: false kkonganti@1: kkonganti@1: --sourmashgather_dna : Choose DNA signature. Default: true kkonganti@1: kkonganti@1: --sourmashgather_nodna : Do not choose DNA signature. Default: false kkonganti@1: kkonganti@1: --sourmashgather_scaled : Scaled value should be between 100 and 1e6 kkonganti@1: . Default: false kkonganti@1: kkonganti@1: --sourmashgather_inc_pat : Search only signatures that match this kkonganti@1: pattern in name, filename, or md5. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --sourmashgather_exc_pat : Search only signatures that do not match kkonganti@1: this pattern in name, filename, or md5. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --sourmashsearch_run : Run `sourmash search` tool. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_n : Number of results to report. By default, kkonganti@1: will terminate at --sourmashsearch_thr kkonganti@1: value. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_thr : Reporting threshold (similarity) to return kkonganti@1: results. Default: 0 kkonganti@1: kkonganti@1: --sourmashsearch_contain : Score based on containment rather than kkonganti@1: similarity. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_maxcontain : Score based on max containment rather than kkonganti@1: similarity. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_ignoreabn : Do NOT use k-mer abundances if present. kkonganti@1: Default: true kkonganti@1: kkonganti@1: --sourmashsearch_ani_ci : Output confidence intervals for ANI kkonganti@1: estimates. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_k : The k-mer size to select. Default: 71 kkonganti@1: kkonganti@1: --sourmashsearch_protein : Choose a protein signature. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_noprotein : Do not choose a protein signature. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --sourmashsearch_dayhoff : Choose Dayhoff-encoded amino acid kkonganti@1: signatures. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_nodayhoff : Do not choose Dayhoff-encoded amino acid kkonganti@1: signatures. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_hp : Choose hydrophobic-polar-encoded amino acid kkonganti@1: signatures. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_nohp : Do not choose hydrophobic-polar-encoded kkonganti@1: amino acid signatures. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_dna : Choose DNA signature. Default: true kkonganti@1: kkonganti@1: --sourmashsearch_nodna : Do not choose DNA signature. Default: false kkonganti@1: kkonganti@1: --sourmashsearch_scaled : Scaled value should be between 100 and 1e6 kkonganti@1: . Default: false kkonganti@1: kkonganti@1: --sourmashsearch_inc_pat : Search only signatures that match this kkonganti@1: pattern in name, filename, or md5. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --sourmashsearch_exc_pat : Search only signatures that do not match kkonganti@1: this pattern in name, filename, or md5. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --sfhpy_run : Run the sourmash_filter_hits.py script. kkonganti@1: Default: true kkonganti@1: kkonganti@1: --sfhpy_fcn : Column name by which filtering of rows kkonganti@1: should be applied. Default: f_match kkonganti@1: kkonganti@1: --sfhpy_fcv : Remove genomes whose match with the query kkonganti@1: FASTQ is less than this much. Default: 0.1 kkonganti@1: kkonganti@1: --sfhpy_gt : Apply greather than or equal to condition kkonganti@1: on numeric values of --sfhpy_fcn column. kkonganti@1: Default: true kkonganti@1: kkonganti@1: --sfhpy_lt : Apply less than or equal to condition on kkonganti@1: numeric values of --sfhpy_fcn column. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --kmaindex_run : Run kma index tool. Default: true kkonganti@1: kkonganti@1: --kmaindex_t_db : Add to existing DB. Default: false kkonganti@1: kkonganti@1: --kmaindex_k : k-mer size. Default: 31 kkonganti@1: kkonganti@1: --kmaindex_m : Minimizer size. Default: false kkonganti@1: kkonganti@1: --kmaindex_hc : Homopolymer compression. Default: false kkonganti@1: kkonganti@1: --kmaindex_ML : Minimum length of templates. Defaults to -- kkonganti@1: kmaindex_k Default: false kkonganti@1: kkonganti@1: --kmaindex_ME : Mega DB. Default: false kkonganti@1: kkonganti@1: --kmaindex_Sparse : Make Sparse DB. Default: false kkonganti@1: kkonganti@1: --kmaindex_ht : Homology template. Default: false kkonganti@1: kkonganti@1: --kmaindex_hq : Homology query. Default: false kkonganti@1: kkonganti@1: --kmaindex_and : Both homology thresholds have to reach. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --kmaindex_nbp : No bias print. Default: false kkonganti@1: kkonganti@1: --kmaalign_run : Run kma tool. Default: true kkonganti@1: kkonganti@1: --kmaalign_int : Input file has interleaved reads. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --kmaalign_ef : Output additional features. Default: false kkonganti@1: kkonganti@1: --kmaalign_vcf : Output vcf file. 2 to apply FT. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_sam : Output SAM, 4/2096 for mapped/aligned. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --kmaalign_nc : No consensus file. Default: true kkonganti@1: kkonganti@1: --kmaalign_na : No aln file. Default: true kkonganti@1: kkonganti@1: --kmaalign_nf : No frag file. Default: true kkonganti@1: kkonganti@1: --kmaalign_a : Output all template mappings. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_and : Use both -mrs and p-value on consensus. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --kmaalign_oa : Use neither -mrs or p-value on consensus. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --kmaalign_bc : Minimum support to call bases. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_bcNano : Altered indel calling for ONT data. Default kkonganti@1: : false kkonganti@1: kkonganti@1: --kmaalign_bcd : Minimum depth to call bases. Default: false kkonganti@1: kkonganti@1: --kmaalign_bcg : Maintain insignificant gaps. Default: false kkonganti@1: kkonganti@1: --kmaalign_ID : Minimum consensus ID. Default: false kkonganti@1: kkonganti@1: --kmaalign_md : Minimum depth. Default: false kkonganti@1: kkonganti@1: --kmaalign_dense : Skip insertion in consensus. Default: false kkonganti@1: kkonganti@1: --kmaalign_ref_fsa : Use Ns on indels. Default: false kkonganti@1: kkonganti@1: --kmaalign_Mt1 : Map everything to one template. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_1t1 : Map one query to one template. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_mrs : Minimum relative alignment score. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_mrc : Minimum query coverage. Default: 0.99 kkonganti@1: kkonganti@1: --kmaalign_mp : Minimum phred score of trailing and leading kkonganti@1: bases. Default: 30 kkonganti@1: kkonganti@1: --kmaalign_mq : Set the minimum mapping quality. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_eq : Minimum average quality score. Default: 30 kkonganti@1: kkonganti@1: --kmaalign_5p : Trim 5 prime by this many bases. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_3p : Trim 3 prime by this many bases Default: kkonganti@1: false kkonganti@1: kkonganti@1: --kmaalign_apm : Sets both -pm and -fpm Default: false kkonganti@1: kkonganti@1: --kmaalign_cge : Set CGE penalties and rewards Default: kkonganti@1: false kkonganti@1: kkonganti@1: --salmonidx_run : Run `salmon index` tool. Default: true kkonganti@1: kkonganti@1: --salmonidx_k : The size of k-mers that should be used for kkonganti@1: the quasi index. Default: false kkonganti@1: kkonganti@1: --salmonidx_gencode : This flag will expect the input transcript kkonganti@1: FASTA to be in GENCODE format, and will kkonganti@1: split the transcript name at the first `|` kkonganti@1: character. These reduced names will be used kkonganti@1: in the output and when looking for these kkonganti@1: transcripts in a gene to transcript GTF. kkonganti@1: Default: false kkonganti@1: kkonganti@1: --salmonidx_features : This flag will expect the input reference kkonganti@1: to be in the tsv file format, and will kkonganti@1: split the feature name at the first `tab` kkonganti@1: character. These reduced names will be used kkonganti@1: in the output and when looking for the kkonganti@1: sequence of the features. GTF. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --salmonidx_keepDuplicates : This flag will disable the default indexing kkonganti@1: behavior of discarding sequence-identical kkonganti@1: duplicate transcripts. If this flag is kkonganti@1: passed then duplicate transcripts that kkonganti@1: appear in the input will be retained and kkonganti@1: quantified separately. Default: false kkonganti@1: kkonganti@1: --salmonidx_keepFixedFasta : Retain the fixed fasta file (without short kkonganti@1: transcripts and duplicates, clipped, etc.) kkonganti@1: generated during indexing. Default: false kkonganti@1: kkonganti@1: --salmonidx_filterSize : The size of the Bloom filter that will be kkonganti@1: used by TwoPaCo during indexing. The filter kkonganti@1: will be of size 2^{filterSize}. A value of kkonganti@1: -1 means that the filter size will be kkonganti@1: automatically set based on the number of kkonganti@1: distinct k-mers in the input, as estimated kkonganti@1: by nthll. Default: false kkonganti@1: kkonganti@1: --salmonidx_sparse : Build the index using a sparse sampling of kkonganti@1: k-mer positions This will require less kkonganti@1: memory (especially during quantification), kkonganti@1: but will take longer to constructand can kkonganti@1: slow down mapping / alignment. Default: kkonganti@1: false kkonganti@1: kkonganti@1: --salmonidx_n : Do not clip poly-A tails from the ends of kkonganti@1: target sequences. Default: false kkonganti@1: kkonganti@1: --gsrpy_run : Run the gen_salmon_res_table.py script. kkonganti@1: Default: true kkonganti@1: kkonganti@1: --gsrpy_url : Generate an additional column in final kkonganti@1: results table which links out to NCBI kkonganti@1: Pathogens Isolate Browser. Default: true kkonganti@1: kkonganti@1: Help options : kkonganti@1: kkonganti@1: --help : Display this message. kkonganti@1: kkonganti@1: ```