comparison 0.2.0/readme/cronology.md @ 0:9e8b1c747a6a draft default tip

planemo upload
author galaxytrakr
date Fri, 29 May 2026 13:32:17 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:9e8b1c747a6a
1 # cronology
2
3 `cronology` is an automated workflow for **_Cronobacter_** whole genome sequence assembly, subtyping and traceback based on [NCBI Pathogen Detection](https://www.ncbi.nlm.nih.gov/pathogens) Project for [Cronobacter](https://www.ncbi.nlm.nih.gov/pathogens/isolates/#taxgroup_name:%22Cronobacter%22). It uses `fastp` for read quality control, `shovill` and `polypolish` for **_de novo_** assembly and genome polishing, `prokka` for gene prediction and annotation, and `quast.py` for assembly quality metrics. User(s) can choose a gold standard reference genome as a model during gene prediction step with `prokka`. By default, `GCF_003516125` (**_Cronobacter sakazakii_**) is used.
4
5 In parallel, for each isolate, whole genome based (genome distances) traceback analysis is performed using `mash` and `mashtree` and the results are saved as a phylogenetic tree in `newick` format. Accompanying metadata generated can be uploaded to [iTOL](https://itol.embl.de/) for tree visualization.
6
7 User(s) can also run pangenome analysis using `pirate` but this will considerably increase the run time of the pipeline if the input has more than ~50 samples.
8
9 \
10  
11
12 <!-- TOC -->
13
14 - [Minimum Requirements](#minimum-requirements)
15 - [CFSAN GalaxyTrakr](#cfsan-galaxytrakr)
16 - [Usage and Examples](#usage-and-examples)
17 - [Database](#database)
18 - [Input](#input)
19 - [Output](#output)
20 - [Sample Clustering](#sample-clustering)
21 - [Computational resources](#computational-resources)
22 - [Runtime profiles](#runtime-profiles)
23 - [your_institution.config](#your_institutionconfig)
24 - [Cloud computing](#cloud-computing)
25 - [Example data](#example-data)
26 - [cronology CLI Help](#cronology-cli-help)
27
28 <!-- /TOC -->
29
30 \
31 &nbsp;
32
33 ## Minimum Requirements
34
35 1. [Nextflow version 23.04.3](https://github.com/nextflow-io/nextflow/releases/download/v23.04.3/nextflow).
36 - Make the `nextflow` binary executable (`chmod 755 nextflow`) and also make sure that it is made available in your `$PATH`.
37 - If your existing `JAVA` install does not support the newest **Nextflow** version, you can try **Amazon**'s `JAVA` (OpenJDK): [Corretto](https://corretto.aws/downloads/latest/amazon-corretto-17-x64-linux-jdk.tar.gz).
38 2. Either of `micromamba` (version `1.0.0`) or `docker` or `singularity` installed and made available in your `$PATH`.
39 - Running the workflow via `micromamba` software provisioning is **preferred** as it does not require any `sudo` or `admin` privileges or any other configurations with respect to the various container providers.
40 - To install `micromamba` for your system type, please follow these [installation steps](https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html#linux-and-macos) and make sure that the `micromamba` binary is made available in your `$PATH`.
41 - Just the `curl` step is sufficient to download the binary as far as running the workflows are concerned.
42 - Once you have finished the installation, **it is important that you downgrade `micromamba` to version `1.0.0`**.
43
44 ```bash
45 micromamba self-update --version 1.0.0
46 ```
47
48 3. Minimum of 10 CPU cores and about 60 GBs for main workflow steps. More memory may be required if your **FASTQ** files are big.
49
50 \
51 &nbsp;
52
53 ## CFSAN GalaxyTrakr
54
55 The `cronology` pipeline is also available for use on the [Galaxy instance supported by CFSAN, FDA](https://galaxytrakr.org/). If you wish to run the analysis using **Galaxy**, please register for an account, after which you can run the workflow by selecting `cronology` under [`Metagenomics:CPIPES`](../assets/cronology_on_galaxytrakr.PNG) tool section.
56
57 Please note that the pipeline on [CFSAN GalaxyTrakr](https://galaxytrakr.org) in most cases may be a version older than the one on **GitHub** due to testing prioritization.
58
59 \
60 &nbsp;
61
62 ## Usage and Examples
63
64 Clone or download this repository and then call `cpipes`.
65
66 ```bash
67 cpipes --pipeline cronology [options]
68 ```
69
70 Alternatively, you can use `nextflow` to directly pull and run the pipeline.
71
72 ```bash
73 nextflow pull CFSAN-Biostatistics/cronology
74 nextflow list
75 nextflow info CFSAN-Biostatistics/cronology
76 nextflow run CFSAN-Biostatistics/cronology --pipeline cronology_db --help
77 nextflow run CFSAN-Biostatistics/cronology --pipeline cronology --help
78 ```
79
80 \
81 &nbsp;
82
83 **Example**: Run the default `cronology` pipeline in single-end mode.
84
85 ```bash
86 cd /data/scratch/$USER
87 mkdir nf-cpipes
88 cd nf-cpipes
89 cpipes
90 --pipeline cronology \
91 --input /path/to/illumina/fastq/dir \
92 --output /path/to/output \
93 --cronology_root_dbdir /data/Kranti_Konganti/cronology_db/PDG000000043.213 \
94 --fq_single_end true
95 ```
96
97 \
98 &nbsp;
99
100 **Example**: Run the `cronology` pipeline in paired-end mode.
101
102 ```bash
103 cd /data/scratch/$USER
104 mkdir nf-cpipes
105 cd nf-cpipes
106 cpipes \
107 --pipeline cronology \
108 --input /path/to/illumina/fastq/dir \
109 --output /path/to/output \
110 --cronology_root_dbdir /data/Kranti_Konganti/cronology_db/PDG000000043.213 \
111 --fq_single_end false
112 ```
113
114 \
115 &nbsp;
116
117 ### Database
118
119 ---
120
121 Although users can choose to run the `cronology_db` pipeline, it requires access to HPC Cluster or a similar cloud setting. Since `GUNC` and `CheckM2` tools are used to filter out low quality assemblies, which require its own databases, the runtime is longer than usual. Therefore, the pre-formatted databases will be provided for download.
122
123 - Download the `PDG000000043.213` version of **NCBI Pathogens release** for **_Cronobacter_**: <https://research.foodsafetyrisk.org/cronology/PDG000000043.213.tar.bz2>.
124
125 \
126 &nbsp;
127
128 ### Input
129
130 ---
131
132 The input to the workflow is a folder containing compressed (`.gz`) FASTQ files. Please note that the sample grouping happens automatically by the file name of the FASTQ file. If for example, a single sample is sequenced across multiple sequencing lanes, you can choose to group those FASTQ files into one sample by using the `--fq_filename_delim` and `--fq_filename_delim_idx` options. By default, `--fq_filename_delim` is set to `_` (underscore) and `--fq_filename_delim_idx` is set to 1.
133
134 For example, if the directory contains FASTQ files as shown below:
135
136 - KB-01_apple_L001_R1.fastq.gz
137 - KB-01_apple_L001_R2.fastq.gz
138 - KB-01_apple_L002_R1.fastq.gz
139 - KB-01_apple_L002_R2.fastq.gz
140 - KB-02_mango_L001_R1.fastq.gz
141 - KB-02_mango_L001_R2.fastq.gz
142 - KB-02_mango_L002_R1.fastq.gz
143 - KB-02_mango_L002_R2.fastq.gz
144
145 Then, to create 2 sample groups, `apple` and `mango`, we split the file name by the delimitor (underscore in the case, which is default) and group by the first 2 words (`--fq_filename_delim_idx 2`).
146
147 This goes without saying that all the FASTQ files should have uniform naming patterns so that `--fq_filename_delim` and `--fq_filename_delim_idx` options do not have any adverse effect in collecting and creating a sample metadata sheet.
148
149 \
150 &nbsp;
151
152 ### Output
153
154 ---
155
156 All the outputs for each step are stored inside the folder mentioned with the `--output` option. A `multiqc_report.html` file inside the `cronology-multiqc` folder can be opened in any browser on your local workstation which contains a consolidated brief report. The tree metadata which can be uploaded to [iTOL](https://itol.embl.de/) for visualization will be located in the `cat_unique` folder.
157
158 \
159 &nbsp;
160
161 ### Sample clustering
162
163 ---
164 Since `v0.2.0`, `cronology` can automatically upload the `mashtree` generated output to [microreact.org](https://microreact.org). For this to work, create an account and [obtain your API access token from microreact.org](https://docs.microreact.org/api/access-tokens#obtain-your-api-access-token), and put it in a file named `microreact_api.key` and save it inside the [assets](../assets/) folder. If you do not wish to automatically upload the tree to [microreact.org](https://microreact.org), you can turn it off during the command call with `--upload_microreact false` CLI option.
165
166 The tree URL generated will be stored inside the `upload_microreact` output folder.
167
168 Example: [https://microreact.org/project/c9GcC9pJ622FeX27f2LFRT-cronologyruntree](https://microreact.org/project/c9GcC9pJ622FeX27f2LFRT-cronologyruntree)
169
170 \
171 &nbsp;
172
173 ### Computational resources
174
175 ---
176
177 The workflow `cronology` requires at least a minimum of 60 GBs of memory to successfully finish the workflow. By default, `cronology` uses 10 CPU cores where possible. You can change this behavior and adjust the CPU cores with `--max_cpus` option.
178
179 \
180 &nbsp;
181
182 Example:
183
184 ```bash
185 cpipes \
186 --pipeline cronology \
187 --input /path/to/cronology_sim_reads \
188 --output /path/to/cronology_sim_reads_output \
189 --cronology_root_dbdir /path/to/PDG000000043.213
190 --max_cpus 5 \
191 -profile stdkondagac \
192 -resume
193 ```
194
195 \
196 &nbsp;
197
198 ### Runtime profiles
199
200 ---
201
202 You can use different run time profiles that suit your specific compute environments i.e., you can run the workflow locally on your machine or in a grid computing infrastructure.
203
204 \
205 &nbsp;
206
207 Example:
208
209 ```bash
210 cd /data/scratch/$USER
211 mkdir nf-cpipes
212 cd nf-cpipes
213 cpipes \
214 --pipeline cronology \
215 --input /path/to/fastq_pass_dir \
216 --output /path/to/where/output/should/go \
217 -profile your_institution
218 ```
219
220 The above command would run the pipeline and store the output at the location per the `--output` flag and the **NEXTFLOW** reports are always stored in the current working directory from where `cpipes` is run. For example, for the above command, a directory called `CPIPES-cronology` would hold all the **NEXTFLOW** related logs, reports and trace files.
221
222 \
223 &nbsp;
224
225 ### `your_institution.config`
226
227 ---
228
229 In the above example, we can see that we have mentioned the run time profile as `your_institution`. For this to work, add the following lines at the end of [`computeinfra.config`](../conf/computeinfra.config) file which should be located inside the `conf` folder. For example, if your institution uses **SGE** or **UNIVA** for grid computing instead of **SLURM** and has a job queue named `normal.q`, then add these lines:
230
231 \
232 &nbsp;
233
234 ```groovy
235 your_institution {
236 process.executor = 'sge'
237 process.queue = 'normal.q'
238 singularity.enabled = false
239 singularity.autoMounts = true
240 docker.enabled = false
241 params.enable_conda = true
242 conda.enabled = true
243 conda.useMicromamba = true
244 params.enable_module = false
245 }
246 ```
247
248 In the above example, by default, all the software provisioning choices are disabled except `conda`. You can also choose to remove the `process.queue` line altogether and the `cronology` workflow will request the appropriate memory and number of CPU cores automatically, which ranges from 1 CPU, 1 GB and 1 hour for job completion up to 10 CPU cores, 1 TB and 120 hours for job completion.
249
250 \
251 &nbsp;
252
253 ### Cloud computing
254
255 ---
256
257 You can run the workflow in the cloud (works only with proper set up of AWS resources). Add new run time profiles with required parameters per [Nextflow docs](https://www.nextflow.io/docs/latest/executor.html):
258
259 \
260 &nbsp;
261
262 Example:
263
264 ```groovy
265 my_aws_batch {
266 executor = 'awsbatch'
267 queue = 'my-batch-queue'
268 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
269 aws.batch.region = 'us-east-1'
270 singularity.enabled = false
271 singularity.autoMounts = true
272 docker.enabled = true
273 params.conda_enabled = false
274 params.enable_module = false
275 }
276 ```
277
278 \
279 &nbsp;
280
281 ### Example data
282
283 ---
284
285 `cronology` was tested on multiple internal sequencing runs and also on publicly available WGS run data. Please make sure that you have all the [minimum requirements](#minimum-requirements) to run the workflow.
286
287 - Download public SRA data for **_Cronobacter_**: [SRR List](../assets/runs_public_cronobacter.txt). You can download a minimized set of sequencing runs for testing purposes.
288 - Download pre-formatted full database for **NCBI Pathogens release**: [PDG000000043.213](https://research.foodsafetyrisk.org/cronology/PDG000000043.213.tar.bz2) (~500 MB).
289 - After succesful run of the workflow, your **MultiQC** report should look something like [this](https://research.foodsafetyrisk.org/cronology/627_crono_multiqc_report.html).
290 - It is always a best practice to use absolute UNIX paths and real destinations of symbolic links during pipeline execution. For example, find out the real path(s) of your absolute UNIX path(s) and use that for the `--input` and `--output` options of the pipeline.
291
292 ```bash
293 realpath /hpc/scratch/user/input
294 ```
295
296 Now, run the workflow:
297
298 \
299 &nbsp;
300
301 ```bash
302 cpipes \
303 --pipeline cronology \
304 --input /path/to/sra_reads \
305 --output /path/to/sra_reads_output \
306 --cronology_root_dbdir /path/to/PDG000000043.213 \
307 --fq_single_end false \
308 --fq_suffix '_1.fastq.gz' --fq2_suffix '_2.fastq.gz' \
309 -profile stdkondagac \
310 -resume
311 ```
312
313 Please note that the run time profile `stdkondagac` will run jobs locally using `micromamba` for software provisioning. The first time you run the command, a new folder called `kondagac_cache` will be created and subsequent runs should use this `conda` cache.
314
315 \
316 &nbsp;
317
318 ## `cronology` CLI Help
319
320 ```text
321 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline cronology --help
322 N E X T F L O W ~ version 23.04.3
323 Launching `./cronology/cpipes` [jovial_colden] DSL2 - revision: 79ea031fad
324 ================================================================================
325 (o)
326 ___ _ __ _ _ __ ___ ___
327 / __|| '_ \ | || '_ \ / _ \/ __|
328 | (__ | |_) || || |_) || __/\__ \
329 \___|| .__/ |_|| .__/ \___||___/
330 | | | |
331 |_| |_|
332 --------------------------------------------------------------------------------
333 A collection of modular pipelines at CFSAN, FDA.
334 --------------------------------------------------------------------------------
335 Name : CPIPES
336 Author : Kranti.Konganti@fda.hhs.gov
337 Version : 0.7.0
338 Center : CFSAN, FDA.
339 ================================================================================
340
341
342 --------------------------------------------------------------------------------
343 Show configurable CLI options for each tool within cronology
344 --------------------------------------------------------------------------------
345 Ex: cpipes --pipeline cronology --help
346 Ex: cpipes --pipeline cronology --help fastp
347 Ex: cpipes --pipeline cronology --help fastp,polypolish
348 --------------------------------------------------------------------------------
349 --help dpubmlstpy : Show dl_pubmlst_profiles_and_schemes.py CLI
350 options CLI options
351 --help fastp : Show fastp CLI options
352 --help spades : Show spades CLI options
353 --help shovill : Show shovill CLI options
354 --help polypolish : Show polypolish CLI options
355 --help quast : Show quast.py CLI options
356 --help prodigal : Show prodigal CLI options
357 --help prokka : Show prokka CLI options
358 --help pirate : Show priate CLI options
359 --help mlst : Show mlst CLI options
360 --help mash : Show mash `screen` CLI options
361 --help tree : Show mashtree CLI options
362 --help abricate : Show abricate CLI options
363
364 ```