annotate 0.2.0/readme/cronology.md @ 15:7c0407ebbdf3

planemo upload
author kkonganti
date Mon, 15 Jul 2024 17:55:16 -0400
parents a5f31c44f8c9
children
rev   line source
kkonganti@11 1 # cronology
kkonganti@11 2
kkonganti@11 3 `cronology` is an automated workflow for **_Cronobacter_** whole genome sequence assembly, subtyping and traceback based on [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens) Project for [Cronobacter](https://www.ncbi.nlm.nih.gov/pathogens/isolates/#taxgroup_name:%22Cronobacter%22). It uses `fastp` for read quality control, `shovill` and `polypolish` for **_de novo_** assembly and genome polishing, `prokka` for gene prediction and annotation, and `quast.py` for assembly quality metrics. User(s) can choose a gold standard reference genome as a model during gene prediction step with `prokka`. By default, `GCF_003516125` (**_Cronobacter sakazakii_**) is used.
kkonganti@11 4
kkonganti@11 5 In parallel, for each isolate, whole genome based (genome distances) traceback analysis is performed using `mash` and `mashtree` and the results are saved as a phylogenetic tree in `newick` format. Accompanying metadata generated can be uploaded to [iTOL](https://itol.embl.de/) for tree visualization.
kkonganti@11 6
kkonganti@11 7 User(s) can also run pangenome analysis using `pirate` but this will considerably increase the run time of the pipeline if the input has more than ~50 samples.
kkonganti@11 8
kkonganti@11 9 \
kkonganti@11 10  
kkonganti@11 11
kkonganti@11 12 <!-- TOC -->
kkonganti@11 13
kkonganti@11 14 - [Minimum Requirements](#minimum-requirements)
kkonganti@11 15 - [CFSAN GalaxyTrakr](#cfsan-galaxytrakr)
kkonganti@11 16 - [Usage and Examples](#usage-and-examples)
kkonganti@11 17 - [Database](#database)
kkonganti@11 18 - [Input](#input)
kkonganti@11 19 - [Output](#output)
kkonganti@11 20 - [Computational resources](#computational-resources)
kkonganti@11 21 - [Runtime profiles](#runtime-profiles)
kkonganti@11 22 - [your_institution.config](#your_institutionconfig)
kkonganti@11 23 - [Cloud computing](#cloud-computing)
kkonganti@11 24 - [Example data](#example-data)
kkonganti@11 25 - [cronology CLI Help](#cronology-cli-help)
kkonganti@11 26
kkonganti@11 27 <!-- /TOC -->
kkonganti@11 28
kkonganti@11 29 \
kkonganti@11 30 &nbsp;
kkonganti@11 31
kkonganti@11 32 ## Minimum Requirements
kkonganti@11 33
kkonganti@11 34 1. [Nextflow version 23.04.3](https://github.com/nextflow-io/nextflow/releases/download/v23.04.3/nextflow).
kkonganti@11 35 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
kkonganti@11 36 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
kkonganti@11 37 2. Either of `micromamba` (version `1.0.0`) or `docker` or `singularity` installed and made available in your `$PATH`.
kkonganti@11 38 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
kkonganti@11 39 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#linux-and-macos) and make sure that the `micromamba` binary is made available in your `$PATH`.
kkonganti@11 40 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
kkonganti@11 41 - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.0.0`**.
kkonganti@11 42
kkonganti@11 43 ```bash
kkonganti@11 44 micromamba self-update --version 1.0.0
kkonganti@11 45 ```
kkonganti@11 46
kkonganti@11 47 3. Minimum of 10 CPU cores and about 60 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
kkonganti@11 48
kkonganti@11 49 \
kkonganti@11 50 &nbsp;
kkonganti@11 51
kkonganti@11 52 ## CFSAN GalaxyTrakr
kkonganti@11 53
kkonganti@11 54 The `cronology` pipeline is also available for use on the [Galaxy instance supported by CFSAN, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which you can run the workflow by selecting `cronology` under [`Metagenomics:CPIPES`](../assets/cronology_on_galaxytrakr.PNG) tool section.
kkonganti@11 55
kkonganti@11 56 Please note that the pipeline on [CFSAN GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization.
kkonganti@11 57
kkonganti@11 58 \
kkonganti@11 59 &nbsp;
kkonganti@11 60
kkonganti@11 61 ## Usage and Examples
kkonganti@11 62
kkonganti@11 63 Clone or download this repository and then call `cpipes`.
kkonganti@11 64
kkonganti@11 65 ```bash
kkonganti@11 66 cpipes --pipeline cronology [options]
kkonganti@11 67 ```
kkonganti@11 68
kkonganti@11 69 Alternatively, you can use `nextflow` to directly pull and run the pipeline.
kkonganti@11 70
kkonganti@11 71 ```bash
kkonganti@11 72 nextflow pull CFSAN-Biostatistics/cronology
kkonganti@11 73 nextflow list
kkonganti@11 74 nextflow info CFSAN-Biostatistics/cronology
kkonganti@11 75 nextflow run CFSAN-Biostatistics/cronology --pipeline cronology_db --help
kkonganti@11 76 nextflow run CFSAN-Biostatistics/cronology --pipeline cronology --help
kkonganti@11 77 ```
kkonganti@11 78
kkonganti@11 79 \
kkonganti@11 80 &nbsp;
kkonganti@11 81
kkonganti@11 82 **Example**: Run the default `cronology` pipeline in single-end mode.
kkonganti@11 83
kkonganti@11 84 ```bash
kkonganti@11 85 cd /data/scratch/$USER
kkonganti@11 86 mkdir nf-cpipes
kkonganti@11 87 cd nf-cpipes
kkonganti@11 88 cpipes
kkonganti@11 89 --pipeline cronology \
kkonganti@11 90 --input /path/to/illumina/fastq/dir \
kkonganti@11 91 --output /path/to/output \
kkonganti@11 92 --cronology_root_dbdir /data/Kranti_Konganti/cronology_db/PDG000000043.213 \
kkonganti@11 93 --fq_single_end true
kkonganti@11 94 ```
kkonganti@11 95
kkonganti@11 96 \
kkonganti@11 97 &nbsp;
kkonganti@11 98
kkonganti@11 99 **Example**: Run the `cronology` pipeline in paired-end mode.
kkonganti@11 100
kkonganti@11 101 ```bash
kkonganti@11 102 cd /data/scratch/$USER
kkonganti@11 103 mkdir nf-cpipes
kkonganti@11 104 cd nf-cpipes
kkonganti@11 105 cpipes \
kkonganti@11 106 --pipeline cronology \
kkonganti@11 107 --input /path/to/illumina/fastq/dir \
kkonganti@11 108 --output /path/to/output \
kkonganti@11 109 --cronology_root_dbdir /data/Kranti_Konganti/cronology_db/PDG000000043.213 \
kkonganti@11 110 --fq_single_end false
kkonganti@11 111 ```
kkonganti@11 112
kkonganti@11 113 \
kkonganti@11 114 &nbsp;
kkonganti@11 115
kkonganti@11 116 ### Database
kkonganti@11 117
kkonganti@11 118 ---
kkonganti@11 119
kkonganti@11 120 Although users can choose to run the `cronology_db` pipeline, it requires access to HPC Cluster or a similar cloud setting. Since `GUNC` and `CheckM2` tools are used to filter out low quality assemblies, which require its own databases, the runtime is longer than usual. Therefore, the pre-formatted databases will be provided for download.
kkonganti@11 121
kkonganti@11 122 - Download the `PDG000000043.213` version of **NCBI Pathogens release** for **_Cronobacter_**: <https://research.foodsafetyrisk.org/cronology/PDG000000043.213.tar.bz2>.
kkonganti@11 123
kkonganti@11 124 \
kkonganti@11 125 &nbsp;
kkonganti@11 126
kkonganti@11 127 ### Input
kkonganti@11 128
kkonganti@11 129 ---
kkonganti@11 130
kkonganti@11 131 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
kkonganti@11 132
kkonganti@11 133 For example, if the directory contains FASTQ files as shown below:
kkonganti@11 134
kkonganti@11 135 - KB-01_apple_L001_R1.fastq.gz
kkonganti@11 136 - KB-01_apple_L001_R2.fastq.gz
kkonganti@11 137 - KB-01_apple_L002_R1.fastq.gz
kkonganti@11 138 - KB-01_apple_L002_R2.fastq.gz
kkonganti@11 139 - KB-02_mango_L001_R1.fastq.gz
kkonganti@11 140 - KB-02_mango_L001_R2.fastq.gz
kkonganti@11 141 - KB-02_mango_L002_R1.fastq.gz
kkonganti@11 142 - KB-02_mango_L002_R2.fastq.gz
kkonganti@11 143
kkonganti@11 144 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
kkonganti@11 145
kkonganti@11 146 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
kkonganti@11 147
kkonganti@11 148 \
kkonganti@11 149 &nbsp;
kkonganti@11 150
kkonganti@11 151 ### Output
kkonganti@11 152
kkonganti@11 153 ---
kkonganti@11 154
kkonganti@11 155 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `cronology-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report. The tree metadata which can be uploaded to [iTOL](https://itol.embl.de/) for visualization will be located in the `cat_unique` folder.
kkonganti@11 156
kkonganti@11 157 \
kkonganti@11 158 &nbsp;
kkonganti@11 159
kkonganti@11 160 ### Computational resources
kkonganti@11 161
kkonganti@11 162 ---
kkonganti@11 163
kkonganti@11 164 The workflow `cronology` requires at least a minimum of 60 GBs of memory to successfully finish the workflow. By default, `cronology` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.
kkonganti@11 165
kkonganti@11 166 \
kkonganti@11 167 &nbsp;
kkonganti@11 168
kkonganti@11 169 Example:
kkonganti@11 170
kkonganti@11 171 ```bash
kkonganti@11 172 cpipes \
kkonganti@11 173 --pipeline cronology \
kkonganti@11 174 --input /path/to/cronology_sim_reads \
kkonganti@11 175 --output /path/to/cronology_sim_reads_output \
kkonganti@11 176 --cronology_root_dbdir /path/to/PDG000000043.213
kkonganti@11 177 --max_cpus 5 \
kkonganti@11 178 -profile stdkondagac \
kkonganti@11 179 -resume
kkonganti@11 180 ```
kkonganti@11 181
kkonganti@11 182 \
kkonganti@11 183 &nbsp;
kkonganti@11 184
kkonganti@11 185 ### Runtime profiles
kkonganti@11 186
kkonganti@11 187 ---
kkonganti@11 188
kkonganti@11 189 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
kkonganti@11 190
kkonganti@11 191 \
kkonganti@11 192 &nbsp;
kkonganti@11 193
kkonganti@11 194 Example:
kkonganti@11 195
kkonganti@11 196 ```bash
kkonganti@11 197 cd /data/scratch/$USER
kkonganti@11 198 mkdir nf-cpipes
kkonganti@11 199 cd nf-cpipes
kkonganti@11 200 cpipes \
kkonganti@11 201 --pipeline cronology \
kkonganti@11 202 --input /path/to/fastq_pass_dir \
kkonganti@11 203 --output /path/to/where/output/should/go \
kkonganti@11 204 -profile your_institution
kkonganti@11 205 ```
kkonganti@11 206
kkonganti@11 207 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-cronology` would hold all the **NEXTFLOW** related logs, reports and trace files.
kkonganti@11 208
kkonganti@11 209 \
kkonganti@11 210 &nbsp;
kkonganti@11 211
kkonganti@11 212 ### `your_institution.config`
kkonganti@11 213
kkonganti@11 214 ---
kkonganti@11 215
kkonganti@11 216 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
kkonganti@11 217
kkonganti@11 218 \
kkonganti@11 219 &nbsp;
kkonganti@11 220
kkonganti@11 221 ```groovy
kkonganti@11 222 your_institution {
kkonganti@11 223 process.executor = 'sge'
kkonganti@11 224 process.queue = 'normal.q'
kkonganti@11 225 singularity.enabled = false
kkonganti@11 226 singularity.autoMounts = true
kkonganti@11 227 docker.enabled = false
kkonganti@11 228 params.enable_conda = true
kkonganti@11 229 conda.enabled = true
kkonganti@11 230 conda.useMicromamba = true
kkonganti@11 231 params.enable_module = false
kkonganti@11 232 }
kkonganti@11 233 ```
kkonganti@11 234
kkonganti@11 235 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `cronology` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
kkonganti@11 236
kkonganti@11 237 \
kkonganti@11 238 &nbsp;
kkonganti@11 239
kkonganti@11 240 ### Cloud computing
kkonganti@11 241
kkonganti@11 242 ---
kkonganti@11 243
kkonganti@11 244 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
kkonganti@11 245
kkonganti@11 246 \
kkonganti@11 247 &nbsp;
kkonganti@11 248
kkonganti@11 249 Example:
kkonganti@11 250
kkonganti@11 251 ```groovy
kkonganti@11 252 my_aws_batch {
kkonganti@11 253 executor = 'awsbatch'
kkonganti@11 254 queue = 'my-batch-queue'
kkonganti@11 255 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
kkonganti@11 256 aws.batch.region = 'us-east-1'
kkonganti@11 257 singularity.enabled = false
kkonganti@11 258 singularity.autoMounts = true
kkonganti@11 259 docker.enabled = true
kkonganti@11 260 params.conda_enabled = false
kkonganti@11 261 params.enable_module = false
kkonganti@11 262 }
kkonganti@11 263 ```
kkonganti@11 264
kkonganti@11 265 \
kkonganti@11 266 &nbsp;
kkonganti@11 267
kkonganti@11 268 ### Example data
kkonganti@11 269
kkonganti@11 270 ---
kkonganti@11 271
kkonganti@11 272 `cronology` was tested on multiple internal sequencing runs and also on publicly available WGS run data. Please make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow.
kkonganti@11 273
kkonganti@11 274 - Download public SRA data for **_Cronobacter_**: [SRR List](../assets/runs_public_cronobacter.txt). You can download a minimized set of sequencing runs for testing purposes.
kkonganti@11 275 - Download pre-formatted full database for **NCBI Pathogens release**: [PDG000000043.213](https://research.foodsafetyrisk.org/cronology/PDG000000043.213.tar.bz2) (~500 MB).
kkonganti@11 276 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://research.foodsafetyrisk.org/cronology/627_crono_multiqc_report.html).
kkonganti@11 277 - It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline.
kkonganti@11 278
kkonganti@11 279 ```bash
kkonganti@11 280 realpath /hpc/scratch/user/input
kkonganti@11 281 ```
kkonganti@11 282
kkonganti@11 283 Now, run the workflow:
kkonganti@11 284
kkonganti@11 285 \
kkonganti@11 286 &nbsp;
kkonganti@11 287
kkonganti@11 288 ```bash
kkonganti@11 289 cpipes \
kkonganti@11 290 --pipeline cronology \
kkonganti@11 291 --input /path/to/sra_reads \
kkonganti@11 292 --output /path/to/sra_reads_output \
kkonganti@11 293 --cronology_root_dbdir /path/to/PDG000000043.213 \
kkonganti@11 294 --fq_single_end false \
kkonganti@11 295 --fq_suffix '_1.fastq.gz' --fq2_suffix '_2.fastq.gz' \
kkonganti@11 296 -profile stdkondagac \
kkonganti@11 297 -resume
kkonganti@11 298 ```
kkonganti@11 299
kkonganti@11 300 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
kkonganti@11 301
kkonganti@11 302 \
kkonganti@11 303 &nbsp;
kkonganti@11 304
kkonganti@11 305 ## `cronology` CLI Help
kkonganti@11 306
kkonganti@11 307 ```text
kkonganti@11 308 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline cronology --help
kkonganti@11 309 N E X T F L O W ~ version 23.04.3
kkonganti@11 310 Launching `./cronology/cpipes` [jovial_colden] DSL2 - revision: 79ea031fad
kkonganti@11 311 ================================================================================
kkonganti@11 312 (o)
kkonganti@11 313 ___ _ __ _ _ __ ___ ___
kkonganti@11 314 / __|| '_ \ | || '_ \ / _ \/ __|
kkonganti@11 315 | (__ | |_) || || |_) || __/\__ \
kkonganti@11 316 \___|| .__/ |_|| .__/ \___||___/
kkonganti@11 317 | | | |
kkonganti@11 318 |_| |_|
kkonganti@11 319 --------------------------------------------------------------------------------
kkonganti@11 320 A collection of modular pipelines at CFSAN, FDA.
kkonganti@11 321 --------------------------------------------------------------------------------
kkonganti@11 322 Name : CPIPES
kkonganti@11 323 Author : Kranti.Konganti@fda.hhs.gov
kkonganti@11 324 Version : 0.7.0
kkonganti@11 325 Center : CFSAN, FDA.
kkonganti@11 326 ================================================================================
kkonganti@11 327
kkonganti@11 328
kkonganti@11 329 --------------------------------------------------------------------------------
kkonganti@11 330 Show configurable CLI options for each tool within cronology
kkonganti@11 331 --------------------------------------------------------------------------------
kkonganti@11 332 Ex: cpipes --pipeline cronology --help
kkonganti@11 333 Ex: cpipes --pipeline cronology --help fastp
kkonganti@11 334 Ex: cpipes --pipeline cronology --help fastp,polypolish
kkonganti@11 335 --------------------------------------------------------------------------------
kkonganti@11 336 --help dpubmlstpy : Show dl_pubmlst_profiles_and_schemes.py CLI
kkonganti@11 337 options CLI options
kkonganti@11 338 --help fastp : Show fastp CLI options
kkonganti@11 339 --help spades : Show spades CLI options
kkonganti@11 340 --help shovill : Show shovill CLI options
kkonganti@11 341 --help polypolish : Show polypolish CLI options
kkonganti@11 342 --help quast : Show quast.py CLI options
kkonganti@11 343 --help prodigal : Show prodigal CLI options
kkonganti@11 344 --help prokka : Show prokka CLI options
kkonganti@11 345 --help pirate : Show priate CLI options
kkonganti@11 346 --help mlst : Show mlst CLI options
kkonganti@11 347 --help mash : Show mash `screen` CLI options
kkonganti@11 348 --help tree : Show mashtree CLI options
kkonganti@11 349 --help abricate : Show abricate CLI options
kkonganti@11 350
kkonganti@11 351 ```