rliterman@0: #! /usr/bin/env nextflow rliterman@0: nextflow.enable.dsl=2 rliterman@0: rliterman@0: // CSP2 Main Script rliterman@0: // Params are read in from command line or from nextflow.config and/or conf/profiles.config rliterman@0: rliterman@0: // Check if help flag was passed rliterman@0: help1 = "${params.help}" == "nohelp" ? "nohelp" : "help" rliterman@0: help2 = "${params.h}" == "nohelp" ? "nohelp" : "help" rliterman@0: rliterman@0: def printHelp() { rliterman@0: println """ rliterman@0: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rliterman@0: CSP2 rliterman@0: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rliterman@0: rliterman@0: Global default params: rliterman@0: rliterman@0: --out Set name for output folder/file prefixes (Default: CSP2_) rliterman@0: --outroot Set output parent directory (Default: CWD; Useful to hardset in nextflow.config if rliterman@0: you want all output go to the same parent folder, with unique IDs set by --out) rliterman@0: --tmp_dir Manually specify a TMP directory for pybedtools output rliterman@0: --help/--h Display this help menu rliterman@0: rliterman@0: rliterman@0: CSP2 can run in the following run modes: rliterman@0: rliterman@0: --runmode Run mode for CSP2: rliterman@0: rliterman@0: - assemble: Assemble read data (--reads/--ref_reads) into FASTA using SKESA rliterman@0: rliterman@0: - align: Given query data (--reads/--fasta) and reference data (--ref_reads/--ref_fasta), rliterman@0: run MUMmer alignment analysis for each query/ref combination rliterman@0: rliterman@0: - screen: Given query data (--reads/--fasta) and reference data (--ref_reads/--ref_fasta) rliterman@0: and/or MUMmer output (.snpdiffs), create a report for raw SNP rliterman@0: distances between each query and reference assembly rliterman@0: rliterman@0: - snp: Given query data (--reads/--fasta) and reference data (--ref_reads/--ref_fasta) rliterman@0: and/or MUMmer output (.snpdiffs), generate alignments and pairwise rliterman@0: distances for all queries based on each reference dataset rliterman@0: rliterman@0: Input Data: rliterman@0: rliterman@0: --fasta Location for query isolate assembly data (.fasta/.fa/.fna). Can be a list of files, a path rliterman@0: to a signle single FASTA, or a path to a directories with assemblies. rliterman@0: --ref_fasta Location for reference isolate assembly data (.fasta/.fa/.fna). Can be a list of files, a rliterman@0: path to a signle single FASTA, or a path to a directories with assemblies. rliterman@0: rliterman@0: --reads Directory or list of directories containing query isolate read data rliterman@0: --readext Read file extension (Default: fastq.gz) rliterman@0: --forward Forward read file suffix (Default: _1.fastq.gz) rliterman@0: --reverse Reverse read file suffix (Default: _2.fastq.gz) rliterman@0: rliterman@0: --ref_reads Directory or list of directories containing reference isolate read data rliterman@0: --ref_readext Reference read file extension (Default: fastq.gz) rliterman@0: --ref_forward Reference forward read file suffix (Default: _1.fastq.gz) rliterman@0: --ref_reverse Reference reverse read file suffix (Default: _2.fastq.gz) rliterman@0: rliterman@0: --snpdiffs Location for pre-generated snpdiffs files (List of snpdiffs files, directory with snpdiffs) rliterman@0: rliterman@0: --ref_id IDs to specify reference sequences (Comma-separated list; e.g., Sample_A,Sample_B,Sample_C) rliterman@0: rliterman@0: --trim_name A common string to remove from all sample IDs (Default: ''; Useful if all assemblies end in rliterman@0: something like "_contigs_skesa.fasta") rliterman@0: rliterman@0: --n_ref If running in --runmode snp, the number of reference genomes for CSP2 to select if none are provided (Default: 1) rliterman@0: rliterman@0: --exclude A comma-separated list of IDs to remove prior to analysis (Useful for removing low quality rliterman@0: isolates in combination with --snpdiffs) rliterman@0: rliterman@0: QC variables: rliterman@0: rliterman@0: --min_cov Only consider queries if the reference genome is covered by at least % (Default: 85) rliterman@0: --min_len Only consider SNPs from contig alignments longer than bp (Default: 500) rliterman@0: --min_iden Only consider SNPs from alignments with at least percent identity (Default: 99) rliterman@0: --dwin A comma-separated set of window sizes for SNP density filters (Default: 1000,125,15; Set --dwin 0 to disable density filtering) rliterman@0: --wsnps A comma-separated set of maximum SNP counts per window above (Default: 3,2,1) rliterman@0: --max_missing If running in --runmode snp, mask SNPs where data is missing or purged from % of isolates (Default: 50) rliterman@0: rliterman@0: Edge Trimming: rliterman@0: rliterman@0: --ref_edge Don't include SNPs that fall within bp of a reference contig edge (Default: 150) rliterman@0: --query_edge Don't include SNPs that fall within bp of a query contig edge (Default: 150) rliterman@0: --rescue If flagged (Default: not flagged), sites that were filtered out due solely to query edge proximity are rescued if rliterman@0: the same reference position is covered more centrally by another query rliterman@0: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rliterman@0: rliterman@0: Example Commands: rliterman@0: rliterman@0: 1) Run CSP2 in SNP Pipeline mode using all the FASTA from /my/data/dir, and choose 3 references rliterman@0: rliterman@0: nextflow run CSP2.nf --runmode snp --fasta /my/data/dir --n_ref 3 rliterman@0: rliterman@0: 2) Screen all the paired-end .fastq files from /my/read/dir against the reference isolate in /my/reference/isolates.txt rliterman@0: rliterman@0: nextflow run CSP2.nf --runmode screen --ref_fasta /my/reference/isolates.txt --reads /my/read/dir --readext .fastq --forward _1.fastq --reverse _2.fastq rliterman@0: rliterman@0: 3) Re-run the SNP pipeline using old snpdiffs files after changing the density filters and removing a bad sample rliterman@0: rliterman@0: nextflow run CSP2.nf --runmode snp --snpdiffs /my/old/analysis/snpdiffs --dwin 5000,2500,1000 --wsnps 6,4,2 --ref_id Sample_A --exclude Sample_Q --out HQ_Density rliterman@0: rliterman@0: 4) Run in assembly mode and use HPC modules specified in profiles.config (NOTE: Setting the profile in nextflow uses a single hyphen (-) as compared to other arguments (--)) rliterman@0: rliterman@0: nextflow run CSP2.nf -profile myHPC --runmode assemble --reads /my/read/dir --out Assemblies rliterman@0: rliterman@0: 5) Run in SNP pipeline mode using SLURM and use the built in conda environment (NOTE: For local jobs using conda, use -profile standard_conda) rliterman@0: rliterman@0: nextflow run CSP2.nf -profile slurm_conda --runmode snp --fasta /my/data/dir --out CSP2_Conda rliterman@0: rliterman@0: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rliterman@0: """ rliterman@0: System.exit(0) rliterman@0: } rliterman@0: rliterman@0: if (help1 == "help") { rliterman@0: printHelp() rliterman@0: } else if(help2 =="help"){ rliterman@0: printHelp() rliterman@0: } rliterman@0: rliterman@0: // Assess run mode rliterman@0: if (params.runmode == "") { rliterman@0: error "--runmode must be specified..." rliterman@0: } else if (!['align','assemble', 'screen', 'snp','conda_init'].contains(params.runmode)){ rliterman@0: error "--runmode must be 'align','assemble', 'screen', or 'snp', not ${params.runmode}..." rliterman@0: } rliterman@0: rliterman@0: // If runmode is conda_init, launch a local process to spurn the generation of the conda environment and exit rliterman@0: if (params.runmode != "conda_init") { rliterman@0: rliterman@0: // Ensure necessary data is provided given the run mode rliterman@0: // Runmode 'assemble' rliterman@0: // - Requires: --reads/--ref_reads rliterman@0: // - Runs SKESA and summarzies output FASTA rliterman@0: if (params.runmode == "assemble"){ rliterman@0: if((params.reads == "") && (params.ref_reads == "")){ rliterman@0: error "Runmode is --assemble but no read data provided via --reads/--ref_reads" rliterman@0: } rliterman@0: } rliterman@0: rliterman@0: // Runmode 'align' rliterman@0: // - Requires: --reads/--fasta/--snpdiffs rliterman@0: // - Optional: --ref_reads/--ref_fasta/--ref_id rliterman@0: // - Runs MUMmer, generates .snpdiffs, and alignment summary. rliterman@0: // - If references are provided via --ref_reads/--ref_fasta/--ref_id, non-reference samples are aligned to each reference rliterman@0: // - If no references are provided, alignments are all-vs-all rliterman@0: // - If --snpdiffs are provided, their FASTAs will be autodetected and, if present, used as queries or references as specified by --ref_reads/--ref_fasta/--ref_id rliterman@0: // - Does NOT perform QC filtering rliterman@0: rliterman@0: else if (params.runmode == "align"){ rliterman@0: if((params.fasta == "") && (params.reads == "") && (params.snpdiffs == "")){ rliterman@0: error "Runmode is --align but no query data provided via --fasta/--reads/--snpdiffs" rliterman@0: } rliterman@0: } rliterman@0: rliterman@0: // Runmode 'screen' rliterman@0: // - Requires: --reads/--fasta/--snpdiffs rliterman@0: // - Optional: --ref_reads/--ref_fasta/--ref_id rliterman@0: // - Generates .snpdiffs files (if needed), applies QC, and generates alignment summaries and SNP distance estimates rliterman@0: // - If references are provided via --ref_reads/--ref_fasta/--ref_id, non-reference samples are aligned to each reference rliterman@0: // - If no references are provided, alignments are all-vs-all rliterman@0: // - If --snpdiffs are provided, (1) they will be QC filtered and included in the output report and (2) their FASTAs will be autodetected and, if present, used as queries or references as specified by --ref_reads/--ref_fasta/--ref_id rliterman@0: rliterman@0: else if (params.runmode == "screen"){ rliterman@0: if((params.fasta == "") && (params.reads == "") && (params.snpdiffs == "")){ rliterman@0: error "Runmode is --screen but no query data provided via --snpdiffs/--reads/--fasta" rliterman@0: } rliterman@0: } rliterman@0: rliterman@0: // Runmode 'snp' rliterman@0: // - Requires: --reads/--fasta/--snpdiffs rliterman@0: // - Optional: --ref_reads/--ref_fasta/--ref_id rliterman@0: // - If references are not provided, runs RefChooser using all FASTAs to choose references (--n_ref sets how many references to choose) rliterman@0: // - Each query is aligned to each reference, and pairwise SNP distances for all queries are generated based on that reference rliterman@0: // - Generates .snpdiffs files (if needed), applies QC, and generates SNP distance data between all queries based on their alignment to each reference rliterman@0: else if (params.runmode == "snp"){ rliterman@0: if((params.snpdiffs == "") && (params.fasta == "") && (params.reads == "")) { rliterman@0: error "Runmode is --snp but no query data provided via --snpdiffs/--reads/--fasta" rliterman@0: } rliterman@0: } rliterman@0: rliterman@0: // Set directory structure rliterman@0: if (params.outroot == "") { rliterman@0: output_directory = file(params.out) rliterman@0: } else { rliterman@0: out_root = file(params.outroot) rliterman@0: output_directory = file("${out_root}/${params.out}") rliterman@0: } rliterman@0: rliterman@0: // If the output directory exists, create a new subdirectory with the default output name ("CSP2_