csp2: CSP2/README.md annotate

annotate CSP2/README.md @ 53:a21f63856acf

"planemo upload"

author	rliterman
date	Thu, 19 Dec 2024 08:53:33 -0500
parents	01431fa12065
children

rev	line source
rliterman@0	1 <p align="center">
rliterman@0	2 <img src="img/Temp_Logo.jpg" alt="drawing" width="400"/>
rliterman@0	3 </p>
rliterman@0	4
rliterman@0	5 # CFSAN SNP Pipeline 2 (CSP2)
rliterman@0	6 #### Dr. Robert Literman
rliterman@0	7 Office of Analytics and Outreach
rliterman@0	8 Center for Food Safety and Applied Nutrition
rliterman@0	9 US Food and Drug Administration
rliterman@0	10
rliterman@0	11 Current Release: [v.0.9.5 (Oct-17-2024)](https://github.com/CFSAN-Biostatistics/CSP2/releases/tag/v.0.9.5)
rliterman@0	12 Last Push: Oct-17-2024
rliterman@0	13
rliterman@0	14 Important Note: CSP2 is currently under development, and has not been validated for non-research purposes. Current workflows and data processing parameters may change prior to full release version.
rliterman@0	15
rliterman@0	16 ## CSP2 is a Nextflow pipeline for rapid, accurate SNP distance estimation from assembly data
rliterman@0	17
rliterman@0	18 CSP2 runs on Unix, with the handful of dependencies listed in the [Software Dependencies](#software-dependencies) section. CSP2 was developed to (1) improve on the speed of the [CFSAN SNP Pipeline (CSP)](https://peerj.com/articles/cs-20/?report=reader), (2) to reduce computational burden when analyzing larger isolate clusters, and (3) to remove the dependency for raw Illumina data. CSP2 relies on the accurate and rapid alignment of genome assemblies provided by [MUmmer](https://github.com/mummer4/mummer), which typically complete within seconds. This provides significant reductions in runtime compared to methods that rely on read mapping. The use of assemblies in place of sequencing data also means that:
rliterman@0	19
rliterman@0	20 - the amount storage needed can be substantially reduced,
rliterman@0	21 - significantly less computational resources are required,
rliterman@0	22 - as long as assemblies are available, isolates can be compared regardless of sequencing platform or whether publicly available sequence data even exists
rliterman@0	23
rliterman@0	24 CSP2 runs are managed via Nextflow, providing the user with an array of [customizations](#tips-for-configuring-csp2) while also facilitating module development and additions in future releases.
rliterman@0	25
rliterman@0	26 Important Note: The software continues to be focused on the analysis of groups of bacterial genomes with limited evolutionary differences (<1000 SNPs). Testing is underway to determine how the underlying cluster diversity impacts distances estimates.
rliterman@0	27
rliterman@0	28 ### CSP2 has two main run modes (See [Examples](#examples)):
rliterman@0	29
rliterman@0	30 #### 1) "Screening Mode" (--runmode screen): Used to determine whether query isolates are close to a set of reference isolates (e.g., lab control strains, strains related to an outbreak, etc.)
rliterman@0	31 Given one or more user-provided reference isolates (--ref_reads; --ref_fasta), get alignment statistics and SNP distances between all reference and query isolates (--reads; --fasta)
rliterman@0	32
rliterman@0	33 #### 2) "SNP Pipeline Mode" (--runmode snp): Used to generate pairwise distances and alignments for a set of query isolates
rliterman@0	34 Generate pairwise SNP distances and alignments for 2+ isolates (--reads; --fasta) based on comparisons to:
rliterman@0	35 - One or more user-provided references (--ref_reads; --ref_fasta), or
rliterman@0	36 - One or more reference isolates selected by RefChooser (--n_ref)
rliterman@0	37
rliterman@0	38 All CSP2 sequence comparisons happen at the assembly level, but if reads are provided CSP2 will perform a genome assembly using SKESA. In either case, CSP2 then calls MUMmer for alignment. If a sufficient portion of the reference genome is aligned (--min_cov), that data is passed through a set of filters that largely mimic those from the CFSAN SNP Pipeline, including the automated removal of:
rliterman@0	39 - Sites from short alignments (--min_len)
rliterman@0	40 - Sites from poorly aligned contigs (--min_iden)
rliterman@0	41 - Sites close to the contig edge (--query_edge/--ref_edge)
rliterman@0	42 - Sites from regions of high SNP density (--dwin/--wsnps)
rliterman@0	43 - Multiply aligned sites
rliterman@0	44 - Non-base sites (e.g., 'N' or '?')
rliterman@0	45 - Heterozygous sites
rliterman@0	46 - Indels (for now)
rliterman@0	47
rliterman@0	48 This final dataset is summarized into a .snpdiffs file, which contains:
rliterman@0	49 1. A one-line header with alignment statistics
rliterman@0	50 2. A BED file of contig mappings that pass QC
rliterman@0	51 3. Information about SNPs (if present)
rliterman@0	52
rliterman@0	53 To avoid unnecessary realignment, once a .snpdiffs file is generated under a particular set of QC parameters (which is hardcoded into the .snpdiffs file as the "QC_String") these files can be used in other CSP2 runs via the --snpdiffs argument (if using the same QC parameters).
rliterman@0	54
rliterman@0	55 ---
rliterman@0	56
rliterman@0	57 ## Software Dependencies
rliterman@0	58 The following software are required to run CSP2. Software version used during CSP2 development noted in parentheses.
rliterman@0	59
rliterman@0	60 - [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html) (22.10.7)
rliterman@0	61 - [Python](https://www.python.org/downloads/) (3.8.1)
rliterman@0	62 - [pybedtools](https://pypi.org/project/pybedtools/)
rliterman@0	63 - [sklearn](https://scikit-learn.org/stable/)
rliterman@0	64 - [MASH](https://github.com/marbl/Mash/releases) (2.3)
rliterman@0	65 - [BEDTools](https://bedtools.readthedocs.io/en/latest/) (2.26.0)
rliterman@0	66 - [MUmmer](https://github.com/mummer4/mummer) (4.0.0)
rliterman@0	67 - [BBTools](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/) (38.94)
rliterman@0	68 - [SKESA](https://github.com/ncbi/SKESA) (2.5.0) [Only required if starting from raw reads]
rliterman@0	69
rliterman@0	70 ---
rliterman@0	71 ## Installing CSP2
rliterman@0	72 CSP2 can be installed by cloning the GitHub repo and configuring the [nextflow.config](nextflow.config) and [profiles.config](conf/profiles.config) to suit your needs
rliterman@0	73
rliterman@0	74 ```
rliterman@0	75 git clone https://github.com/CFSAN-Biostatistics/CSP2.git
rliterman@0	76 ```
rliterman@0	77
rliterman@0	78 ## Tips for configuring CSP2
rliterman@0	79 CSP2 options can be specified on the command line, or through the Nextflow configuration files detailed in the next section. Feel free to skip this section if you're familiar with editing Nextflow configuration files.
rliterman@0	80
rliterman@0	81 There are two main configuration files associated with CSP2:
rliterman@0	82
rliterman@0	83 - The profiles.config file is where you add custom information about your computing environment, but you can also set parameters here as well. An example configuration setup (slurmHPC) is provided as a model.
rliterman@0	84
rliterman@0	85 - In this example profile, access to the required programs relies on the loading of modules. However, there is no need to specify a module for Python, MUMmer, SKESA, bedtools, or MASH if those programs are already in your path.
rliterman@0	86
rliterman@0	87 ```
rliterman@0	88 profiles {
rliterman@0	89 standard {
rliterman@0	90 process.executor = 'local'
rliterman@0	91 params.cores = 1
rliterman@0	92 params.python_module = ""
rliterman@0	93 params.mummer_module = ""
rliterman@0	94 params.skesa_module = ""
rliterman@0	95 params.bedtools_module = ""
rliterman@0	96 params.mash_module = ""
rliterman@0	97 params.bbtools_module = ""
rliterman@0	98 }
rliterman@0	99 slurmHPC {
rliterman@0	100 process.executor = 'slurm'
rliterman@0	101 params.cores = 20
rliterman@0	102 params.python_module = "/nfs/software/modules/python/3.8.1"
rliterman@0	103 params.mummer_module = "/nfs/software/modules/mummer/4.0.0"
rliterman@0	104 params.skesa_module = "/nfs/software/modules/skesa/2.5.0"
rliterman@0	105 params.bedtools_module = "/nfs/sw/Modules/bedtools"
rliterman@0	106 params.bbtools_module = "/nfs/software/modules/bbtools/38.94"
rliterman@0	107 params.mash_module = "/nfs/software/modules/mash/2.3"
rliterman@0	108 params.trim_name = "_contigs_skesa"
rliterman@0	109 }
rliterman@0	110 }
rliterman@0	111 ```
rliterman@0	112 - If you plan to run CSP2 locally, be sure to edit params.cores in the standard profile to match the available cores on your system
rliterman@0	113 - If you add your own profile, be sure to note it on the command line (one hypen)
rliterman@0	114 ```
rliterman@0	115 nextflow run CSP2.nf -profile myNewProfile <args>
rliterman@0	116 ```
rliterman@0	117
rliterman@0	118 - The nextflow.config file is where you can change other aspects of the CSP2 run, including data location, QC parameters, and all the options listed below:
rliterman@0	119
rliterman@0	120 Options with defaults include:
rliterman@0	121
rliterman@0	122 \| Parameter \| Description \| Default Value \|
rliterman@0	123 \|------------------\|------------------------------------------------------------------------------------------------------------\|-------------------------------------------\|
rliterman@0	124 \| --outroot \| Base directory to create output folder \| $CWD \|
rliterman@0	125 \| --out \| Name of the output folder to create (must not exist) \| CSP2_(java.util.Date().getTime()) \|
rliterman@0	126 \| --forward \| Full file extension for forward/left reads of query \| _1.fastq.gz \|
rliterman@0	127 \| --reverse \| Full file extension for reverse/right reads of reference \| _2.fastq.gz \|
rliterman@0	128 \| --ref_forward \| Full file extension for forward/left reads of reference \| _1.fastq.gz \|
rliterman@0	129 \| --ref_reverse \| Full file extension for reverse/right reads of reference \| _2.fastq.gz \|
rliterman@0	130 \| --readext \| Extension for single-end reads for query \| fastq.gz \|
rliterman@0	131 \| --ref_readext \| Extension for single-end reads for reference \| fastq.gz \|
rliterman@0	132 \| --min_cov \| Do not analyze queries that cover less than <min_cov>% of the reference assembly \| 85 \|
rliterman@0	133 \| --min_iden \| Only consider alignments where the percent identity is at least <min_iden>% \| 99 \|
rliterman@0	134 \| --min_len \| Only consider alignments that span at least <min_len>bp \| 500 \|
rliterman@0	135 \| --dwin \| A comma-separated list of windows to check SNP densities \| 1000,125,15 \|
rliterman@0	136 \| --wsnps \| The maximum number of SNPs allowed in the corresponding window from --dwin \| 3,2,1 \|
rliterman@0	137 \| --query_edge \| Only consider SNPs that occur within <query_edge>bp of the end of a query contig \| 150 \|
rliterman@0	138 \| --ref_edge \| Only consider SNPs that occur within <query_edge>bp of the end of a reference contig \| 150 \|
rliterman@0	139 \| --n_ref \| The number of reference isolates to consider (only applied if CSP2 is choosing references) \| 1 \|
rliterman@0	140 \| --rescue \| If running in SNP Pipeline mode, rescue edge-filtered SNPs that are not edge filtered in 1+ query \| Unset (Do not rescue) \|
rliterman@0	141
rliterman@0	142 Options without defaults include:
rliterman@0	143 \| Parameter \| Description \|
rliterman@0	144 \|------------------------\|------------------------------------------------------------------------------------------------------------------------------------\|
rliterman@0	145 \| --reads \| Location of query read data (Path to directory, or path to file with multiple directories) \|
rliterman@0	146 \| --fasta \| Location of query assembly data (Path to directory containing FASTAs, path to FASTA, file with list of multiple FASTA paths) \|
rliterman@0	147 \| --ref_reads \| Location of reference read data (Path to directory, or path to file with multiple directories) \|
rliterman@0	148 \| --ref_fasta \| Location of reference assembly data (Path to directory containing FASTAs, path to FASTA, path to multiple FASTAs) \|
rliterman@0	149 \| --python_module \| Name of Python module if 'module load python' statement is required. \|
rliterman@0	150 \| --mummer_module \| Name of MUmmer module if 'module load mummer' statement is required. \|
rliterman@0	151 \| --skesa_module \| Name of SKESA module if 'module load skesa' statement is required. \|
rliterman@0	152 \| --bedtools_module \| Name of BEDTools module if 'module load bedtools' statement is required. \|
rliterman@0	153 \| --mash_module \| Name of MASH module if 'module load mash' statement is required. \|
rliterman@0	154 \| --trim_name \| A string in assembly file names that you want to remove from sample IDs (e.g., _contigs_skesa) \|
rliterman@0	155
rliterman@0	156 ---
rliterman@0	157
rliterman@0	158 ## Examples
rliterman@0	159
rliterman@0	160 The [CSP2 Test Data repo](https://github.com/CFSAN-Biostatistics/CSP2_TestData) contains small test datasets to ensure things are running as expected. Here are a few examples of how you can use CSP2 in screening mode or in SNP pipeline mode.
rliterman@0	161
rliterman@0	162 ## Screening Mode (Example)
rliterman@0	163 Situation: As part of a long-term microbiology experiment, you perform weekly WGS sequencing on isolates as they evolve under different selective conditions. As results, your DNA sequencing facility returns raw WGS reads and assembled genomes.
rliterman@0	164
rliterman@0	165 During Week 42, analyses start detecting high numbers of mutations, and assembly-based results are not concordant with read-based results. You suspect that either the reads or assembly you were given may be from their lab control strain, but you want to check first.
rliterman@0	166
rliterman@0	167 The data:
rliterman@0	168 - Read data
rliterman@0	169 - Week_42_Reads_1.fq.gz; Week_42_Reads_2.fq.gz
rliterman@0	170 - Assembled data:
rliterman@0	171 - Week_42_Assembly.fa
rliterman@0	172 - Lab_Control.fasta
rliterman@0	173
rliterman@0	174 In this case, we want to use --runmode screen, because we want to explicitly check if either dataset matches the reference strain.
rliterman@0	175
rliterman@0	176 - Note: By default, CSP2 expects read data as zipped fastqs (fastq.gz), with paired-end reads denoted as _1.fastq.gz and _2.fastg.gz. These settings can be changed:
rliterman@0	177 - Permanently in the nextflow.config file
rliterman@0	178 - Situationally in profiles.config
rliterman@0	179 - Directly on the command line, as in the example below:
rliterman@0	180 - Query Reads: --readext; --forward; --reverse
rliterman@0	181 - Reference Reads: --ref_readext; --ref_forward; --ref_reverse
rliterman@0	182
rliterman@0	183 To run this example locally, where Nextflow, SKESA, MUMmer, Python, and BEDTools are installed on your path, run:
rliterman@0	184
rliterman@0	185 ```
rliterman@0	186 nextflow run CSP2.nf --out Test_Output/Contamination_Screen --runmode screen --ref_fasta assets/Screen/Assembly/Lab_Control.fasta --fasta assets/Screen/Assembly/Week_42_Assembly.fasta --reads assets/Screen/Reads --forward _1.fq.gz --reverse _2.fq.gz --readext fq.gz
rliterman@0	187 ```
rliterman@0	188
rliterman@0	189 ```
rliterman@0	190 nextflow run CSP2.nf // Run CSP2
rliterman@0	191 --out Test_Output/Contamination_Screen // Save results to ./Test_Output/Contamination_Screen
rliterman@0	192 --runmode screen // Compare each query to the reference
rliterman@0	193 --ref_fasta assets/Screen/Assembly/Lab_Control.fasta // Compare all queries to this reference
rliterman@0	194 --fasta assets/Screen/Assembly/Week_42_Assembly.fasta // Include this assembly as a query
rliterman@0	195 --reads assets/Screen/Reads // Include any read datasets from this directory as queries
rliterman@0	196 --forward _1.fq.gz // Forward reads don't match the default '_1.fastq.gz'
rliterman@0	197 --reverse _2.fq.gz // Reverse reads don't match the default '_2.fastq.gz'
rliterman@0	198 --readext fq.gz // Reads don't match the default 'fastq.gz'
rliterman@0	199 ```
rliterman@0	200
rliterman@0	201 If you're running on an HPC and you need to load modules, you could include your custom profile:
rliterman@0	202
rliterman@0	203 ```
rliterman@0	204 # Load Nextflow module if necessary
rliterman@0	205 module load nextflow
rliterman@0	206
rliterman@0	207 nextflow run CSP2.nf -profile slurmHPC --out Test_Output/Contamination_Screen --runmode screen --ref_fasta assets/Screen/Assembly/Lab_Control.fasta --fasta assets/Screen/Assembly/Week_42_Assembly.fasta --reads assets/Screen/Reads --forward _1.fq.gz --reverse _2.fq.gz --readext fq.gz
rliterman@0	208 ```
rliterman@0	209 ```
rliterman@0	210 nextflow run CSP2.nf // Run CSP2
rliterman@0	211 -profile slurmHPC // Choose run profile (note single hyphen)
rliterman@0	212 --out Test_Output/Contamination_Screen // Save results to ./Test_Output/Contamination_Screen
rliterman@0	213 --runmode screen // Compare each query to the reference
rliterman@0	214 --ref_fasta assets/Screen/Assembly/Lab_Control.fasta // Compare all queries to this reference
rliterman@0	215 --fasta assets/Screen/Assembly/Week_42_Assembly.fasta // Include this assembly as a query
rliterman@0	216 --reads assets/Screen/Reads // Include any read datasets from this directory as queries
rliterman@0	217 --forward _1.fq.gz // Forward reads don't match the default '_1.fastq.gz'
rliterman@0	218 --reverse _2.fq.gz // Reverse reads don't match the default '_2.fastq.gz'
rliterman@0	219 --readext fq.gz // Reads don't match the default 'fastq.gz'
rliterman@0	220 ```
rliterman@0	221
rliterman@0	222 ### Output
rliterman@0	223
rliterman@0	224 If all went well, you should see something like this:
rliterman@0	225 <p align="center">
rliterman@0	226 <img src="img/Screen_Run.jpg" alt="Nextflow print output for screening run"/>
rliterman@0	227 </p>
rliterman@0	228
rliterman@0	229 From top to bottom, we can see that CSP2:
rliterman@0	230 - Found the paired-end reads
rliterman@0	231 - Assembled them using SKESA
rliterman@0	232 - Saved an assembly log
rliterman@0	233 - Aligned both queries against the reference using MUMmer
rliterman@0	234 - Ran the screening script to generate the output table
rliterman@0	235
rliterman@0	236 Let's take a look to see what was generated:
rliterman@0	237 ```
rliterman@0	238 ls Test_Output/Contamination_Screen
rliterman@0	239
rliterman@0	240 Assemblies/
rliterman@0	241 Isolate_Data.tsv
rliterman@0	242 logs/
rliterman@0	243 MUMmer_Output/
rliterman@0	244 Raw_MUMmer_Summary.tsv
rliterman@0	245 Screening_Results.tsv
rliterman@0	246 snpdiffs/
rliterman@0	247 ```
rliterman@0	248 -Note: This output is available to inspect in [assets/Screen/Output](assets/Screen/Output)
rliterman@0	249
rliterman@0	250 Directories
rliterman@0	251
rliterman@0	252 - Assemblies: Directory where any SKESA assemblies are stored
rliterman@0	253 - logs: Directory where run logs are stored
rliterman@0	254 - MUMmer_Output: Directory where raw MUMmer .snps, .report, .1coords are stored
rliterman@0	255 - snpdiffs: Directory where .snpdiffs files are stored
rliterman@0	256
rliterman@0	257 Files
rliterman@0	258 - Isolate_Data.tsv: A TSV file with basic FASTA stats for each isolate
rliterman@0	259
rliterman@0	260 \| Isolate_ID \| Isolate_Type \| Assembly_Path \| Contig_Count \| Assembly_Bases \| N50 \| N90 \| L50 \| L90 \| SHA256 \|
rliterman@0	261 \|------------------\|--------------\|-----------------------------------------------------------------------------------------------------------------------------------------\|--------------\|----------------\|-------\|-------\|-----\|-----\|------------------------------------------------------------------\|
rliterman@0	262 \| Lab_Control \| Reference \| /flash/storage/scratch/Robert.Literman/NextFlow/CSP2/assets/Screen/Assembly/Lab_Control.fasta \| 254 \| 4584986 \| 41813 \| 13112 \| 38 \| 110 \| aae3a07d055bff2fa66127ca77cae35dd5cce5cc42dafea481787d4010c7dbef \|
rliterman@0	263 \| Week_42_Reads \| Query \| /flash/storage/scratch/Robert.Literman/NextFlow/CSP2/Test_Output/241024_Test_Output/Contamination_Screen/Assemblies/Week_42_Reads.fasta \| 372 \| 4561747 \| 24332 \| 7894 \| 56 \| 171 \| 28eca0ecf14fcc1166ae7236c849acc08ad040cd011fc4331ba124db73601009 \|
rliterman@0	264 \| Week_42_Assembly \| Query \| /flash/storage/scratch/Robert.Literman/NextFlow/CSP2/assets/Screen/Assembly/Week_42_Assembly.fasta \| 747 \| 4473771 \| 12782 \| 3438 \| 105 \| 360 \| 85c216de6e1bb9ccf76e7e5d64931884375d66e765f4c4fe726f3be15eb91563 \|
rliterman@0	265
rliterman@0	266
rliterman@0	267 - Screening_Results_PassQC.tsv: A TSV file with screening results for all queries that made it through QC
rliterman@0	268
rliterman@0	269 \| Query_ID \| Reference_ID \| Screen_Category \| CSP2_Screen_SNPs \| Query_Percent_Aligned \| Reference_Percent_Aligned \| Query_Contigs \| Query_Bases \| Reference_Contigs \| Reference_Bases \| Raw_SNPs \| Purged_Length \| Purged_Identity \| Purged_LengthIdentity \| Purged_Invalid \| Purged_Indel \| Purged_Duplicate \| Purged_Het \| Purged_Density \| Filtered_Query_Edge \| Filtered_Ref_Edge \| Filtered_Both_Edge \| Kmer_Similarity \| Shared_Kmers \| Query_Unique_Kmers \| Reference_Unique_Kmers \| MUMmer_gSNPs \| MUMmer_gIndels \|
rliterman@0	270 \|------------------\|--------------\|-----------------\|------------------\|-----------------------\|---------------------------\|---------------\|-------------\|-------------------\|-----------------\|----------\|---------------\|-----------------\|-----------------------\|----------------\|--------------\|------------------\|------------\|----------------\|---------------------\|-------------------\|--------------------\|-----------------\|--------------\|--------------------\|------------------------\|--------------\|----------------\|
rliterman@0	271 \| Week_42_Assembly \| Lab_Control \| Pass \| 1 \| 99.1 \| 96.7 \| 747 \| 4473771 \| 254 \| 4584986 \| 1 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 97.32 \| 4449973 \| 746 \| 121740 \| 1 \| 0 \|
rliterman@0	272 \| Week_42_Reads \| Lab_Control \| Pass \| 50 \| 99.15 \| 98.65 \| 372 \| 4561747 \| 254 \| 4584986 \| 61 \| 0 \| 0 \| 12 \| 0 \| 2 \| 0 \| 0 \| 0 \| 1 \| 1 \| 0 \| 98.51 \| 4526310 \| 23072 \| 45403 \| 58 \| 1 \|
rliterman@0	273
rliterman@0	274 - Columns
rliterman@0	275
rliterman@0	276 \| Column \| Contents \|
rliterman@0	277 \|---------------------------\|----------------------------------------------------------\|
rliterman@0	278 \| Query_ID \| Name of query isolate \|
rliterman@0	279 \| Reference_ID \| Name of reference isolate \|
rliterman@0	280 \| Screen_Category \| Pass/Fail \|
rliterman@0	281 \| CSP2_Screen_SNPs \| SNP distance between query and reference \|
rliterman@0	282 \| Query_Percent_Aligned \| Percent of query genome aligned to reference \|
rliterman@0	283 \| Reference_Percent_Aligned \| Percent of reference genome aligned to query \|
rliterman@0	284 \| Query_Contigs \| Contigs in query assembly \|
rliterman@0	285 \| Query_Bases \| Bases in query assembly \|
rliterman@0	286 \| Reference_Contigs \| Contigs in reference assembly \|
rliterman@0	287 \| Reference_Bases \| Bases in reference assembly \|
rliterman@0	288 \| Raw_SNPs \| Raw SNPs prior to filtering \|
rliterman@0	289 \| Purged_Length \| SNPs purged due to contigs less than min_len \|
rliterman@0	290 \| Purged_Identity \| SNPs purged due to percent identity less than min_iden \|
rliterman@0	291 \| Purged_LengthIdentity \| SNPs purged to short + low quality alignments \|
rliterman@0	292 \| Purged_Invalid \| SNPs purged due to non-ACTG \|
rliterman@0	293 \| Purged_Indel \| SNPs purged as indels \|
rliterman@0	294 \| Purged_Duplicate \| SNPs purged as duplicate alignments \|
rliterman@0	295 \| Purged_Het \| SNPs purged with heterozygous signal \|
rliterman@0	296 \| Purged_Density \| SNPs purged for having more than wsnps in dwin \|
rliterman@0	297 \| Filtered_Query_Edge \| SNPs purged for being too close to query contig edge \|
rliterman@0	298 \| Filtered_Ref_Edge \| SNPs purged for being too close to reference contig edge \|
rliterman@0	299 \| Filtered_Both_Edge \| SNPs purged for being too close to both contig edges \|
rliterman@0	300 \| Kmer_Similarity \| Kmer similarity between query and reference \|
rliterman@0	301 \| Shared_Kmers \| Number of shared kmers between query and reference \|
rliterman@0	302 \| Query_Unique_Kmers \| Unique kmer count for query \|
rliterman@0	303 \| Reference_Unique_Kmers \| Unique kmer count for reference \|
rliterman@0	304 \| MUMmer_gSNPs \| Number of MUMmer gSNPs \|
rliterman@0	305 \| MUMmer_gIndels \| Number of MUMmer gIndels \|
rliterman@0	306
rliterman@0	307 - SNPDiffs Files
rliterman@0	308 - The .snpdiffs files generated by CSP2 have three main components:
rliterman@0	309
rliterman@0	310 1. Header: The header row of a .snpdiffs file contains all the same information as the screening results TSV.
rliterman@0	311 - To peek at a .snpdiffs header, try:
rliterman@0	312 ```
rliterman@0	313 head -1 (QUERY)__vs__(REF).snpdiffs \| tr "\t" "\n"
rliterman@0	314 ```
rliterman@0	315
rliterman@0	316 2. BED File: Assuming sufficient overlap in assemblies, the next section will be a BED file of all the 1-to-1 overlaps between the query and reference. This section is denoted by '##\t'
rliterman@0	317 ```
rliterman@0	318 grep "##" Week_42_Reads__vs__Lab_Control.snpdiffs \| head -10
rliterman@0	319 ```
rliterman@0	320
rliterman@0	321
rliterman@0	322 \| \| Reference_Contig \| Reference_Align_Start \| Reference_Align_End \| Reference_Contig_Length \| Reference_Contig_Aligned \| Query_Contig \| Query_Align_Start \| Query_Align_End \| Query_Contig_Length \| Query_Contig_Aligned \| Percent_Identity \|
rliterman@0	323 \|----\|-------------------------\|-----------------------\|---------------------\|-------------------------\|--------------------------\|--------------------\|-------------------\|-----------------\|---------------------\|----------------------\|------------------\|
rliterman@0	324 \| ## \| SRR16119110_100_32.8137 \| 0 \| 6519 \| 36207 \| 6519 \| Contig_277_13.5424 \| 4804 \| 11323 \| 11323 \| 6519 \| 100 \|
rliterman@0	325 \| ## \| SRR16119110_100_32.8137 \| 6586 \| 7524 \| 36207 \| 938 \| Contig_73_15.3932 \| 0 \| 938 \| 938 \| 938 \| 100 \|
rliterman@0	326 \| ## \| SRR16119110_100_32.8137 \| 7828 \| 36104 \| 36207 \| 28276 \| Contig_8_13.8951 \| 0 \| 28276 \| 28276 \| 28276 \| 100 \|
rliterman@0	327 \| ## \| SRR16119110_102_54.229 \| 225 \| 12863 \| 12876 \| 12638 \| Contig_232_18.7715 \| 0 \| 12638 \| 12638 \| 12638 \| 100 \|
rliterman@0	328 \| ## \| SRR16119110_103_30.5388 \| 143 \| 6585 \| 6585 \| 6442 \| Contig_355_13.8058 \| 6300 \| 12742 \| 12742 \| 6442 \| 100 \|
rliterman@0	329 \| ## \| SRR16119110_104_40.3209 \| 0 \| 12590 \| 19820 \| 12590 \| Contig_153_15.5804 \| 8748 \| 21338 \| 21338 \| 12590 \| 100 \|
rliterman@0	330 \| ## \| SRR16119110_104_40.3209 \| 12664 \| 19615 \| 19820 \| 6951 \| Contig_35_15.127 \| 0 \| 6951 \| 6951 \| 6951 \| 100 \|
rliterman@0	331 \| ## \| SRR16119110_105_32.917 \| 33 \| 11125 \| 11125 \| 11092 \| Contig_57_14.4746 \| 37635 \| 48727 \| 48727 \| 11092 \| 100 \|
rliterman@0	332 \| ## \| SRR16119110_106_22.2235 \| 0 \| 2959 \| 2959 \| 2959 \| Contig_166_11.9238 \| 0 \| 2959 \| 3512 \| 2959 \| 100 \|
rliterman@0	333 \| ## \| SRR16119110_107_44.0162 \| 139 \| 12900 \| 204844 \| 12761 \| Contig_174_15.2027 \| 0 \| 12761 \| 12761 \| 12761 \| 100 \|
rliterman@0	334
rliterman@0	335 3. SNP Data: Finally, the third section of a .snpdiffs file contains all the data on SNPs that passed and failed QC. Only SNPs with the Category "SNP" are counted in the final distance.
rliterman@0	336 ```
rliterman@0	337 grep -v "#" Week_42_Reads__vs__Lab_Control.snpdiffs \| head -10
rliterman@0	338 ```
rliterman@0	339
rliterman@0	340 \| Reference_Contig \| Reference_Loc_Start \| Reference_Loc_End \| Query_Contig \| Query_Loc_Start \| Query_Loc_End \| Ref_Loc_ID \| Query_Loc_ID \| Reference_Alignment_Start \| Reference_Alignment_End \| Query_Alignment_Start \| Query_Alignment_End \| Reference_Base \| Query_Base \| Distance_to_Reference_Contig_Edge \| Distance_to_Query_Contig_Edge \| Reference_Aligned \| Query_Aligned \| Direction \| Percent_Identity \| Category \|
rliterman@0	341 \|-------------------------\|---------------------\|-------------------\|--------------------\|-----------------\|---------------\|--------------------------------\|--------------------------\|---------------------------\|-------------------------\|-----------------------\|---------------------\|----------------\|------------\|-----------------------------------\|-------------------------------\|-------------------\|---------------\|-----------\|------------------\|----------\|
rliterman@0	342 \| SRR16119110_107_44.0162 \| 79353 \| 79354 \| Contig_138_15.1806 \| 38294 \| 38295 \| SRR16119110_107_44.0162/79354 \| Contig_138_15.1806/38295 \| 41059 \| 116706 \| 0 \| 75647 \| A \| T \| 79354 \| 37352 \| 75647 \| 75647 \| 1 \| 99.99 \| SNP \|
rliterman@0	343 \| SRR16119110_107_44.0162 \| 96797 \| 96798 \| Contig_138_15.1806 \| 55738 \| 55739 \| SRR16119110_107_44.0162/96798 \| Contig_138_15.1806/55739 \| 41059 \| 116706 \| 0 \| 75647 \| C \| T \| 96798 \| 19908 \| 75647 \| 75647 \| 1 \| 99.99 \| SNP \|
rliterman@0	344 \| SRR16119110_107_44.0162 \| 105821 \| 105822 \| Contig_138_15.1806 \| 64762 \| 64763 \| SRR16119110_107_44.0162/105822 \| Contig_138_15.1806/64763 \| 41059 \| 116706 \| 0 \| 75647 \| G \| A \| 99022 \| 10884 \| 75647 \| 75647 \| 1 \| 99.99 \| SNP \|
rliterman@0	345 \| SRR16119110_107_44.0162 \| 170608 \| 170609 \| Contig_93_15.202 \| 49811 \| 49812 \| SRR16119110_107_44.0162/170609 \| Contig_93_15.202/49812 \| 120797 \| 204844 \| 0 \| 84045 \| G \| A \| 34235 \| 47655 \| 84047 \| 84045 \| 1 \| 99.99 \| SNP \|
rliterman@0	346 \| SRR16119110_107_44.0162 \| 183055 \| 183056 \| Contig_93_15.202 \| 62258 \| 62259 \| SRR16119110_107_44.0162/183056 \| Contig_93_15.202/62259 \| 120797 \| 204844 \| 0 \| 84045 \| C \| T \| 21788 \| 35208 \| 84047 \| 84045 \| 1 \| 99.99 \| SNP \|
rliterman@0	347 \| SRR16119110_135_26.5159 \| 2370 \| 2371 \| Contig_215_11.8373 \| 2370 \| 2371 \| SRR16119110_135_26.5159/2371 \| Contig_215_11.8373/2371 \| 0 \| 2511 \| 0 \| 2511 \| G \| A \| 140 \| 789 \| 2511 \| 2511 \| 1 \| 99.96 \| SNP \|
rliterman@0	348 \| SRR16119110_150_30.823 \| 36612 \| 36613 \| Contig_168_14.004 \| 36883 \| 36884 \| SRR16119110_150_30.823/36613 \| Contig_168_14.004/36884 \| 0 \| 49916 \| 271 \| 50187 \| C \| T \| 13485 \| 13303 \| 49916 \| 49916 \| 1 \| 99.99 \| SNP \|
rliterman@0	349 \| SRR16119110_153_30.6937 \| 3685 \| 3686 \| Contig_233_12.8229 \| 3685 \| 3686 \| SRR16119110_153_30.6937/3686 \| Contig_233_12.8229/3686 \| 0 \| 7254 \| 0 \| 7254 \| G \| A \| 3568 \| 3594 \| 7254 \| 7254 \| 1 \| 99.99 \| SNP \|
rliterman@0	350 \| SRR16119110_156_32.1591 \| 12089 \| 12090 \| Contig_281_12.9967 \| 13905 \| 13906 \| SRR16119110_156_32.1591/12090 \| Contig_281_12.9967/13906 \| 0 \| 24120 \| 1816 \| 25936 \| A \| T \| 12030 \| 13591 \| 24120 \| 24120 \| 1 \| 99.99 \| SNP \|
rliterman@0	351 \| SRR16119110_156_32.1591 \| 19366 \| 19367 \| Contig_281_12.9967 \| 21182 \| 21183 \| SRR16119110_156_32.1591/19367 \| Contig_281_12.9967/21183 \| 0 \| 24120 \| 1816 \| 25936 \| A \| G \| 4753 \| 6314 \| 24120 \| 24120 \| 1 \| 99.99 \| SNP \|
rliterman@0	352
rliterman@0	353 Conclusions
rliterman@0	354
rliterman@0	355 Based on these results, the assembly provided by the DNA sequencing facility had only 1 SNP relative to the lab control strain, but the read data was over 40 SNPs away, suggesting that the wrong assembly was packaged with the read data.
rliterman@0	356
rliterman@0	357 ## SNP Pipeline Mode (Example)
rliterman@0	358 Situation: You dug 10 soil samples and isolated a single microbe from each. You want to check the relatedness of the cultured isolates.
rliterman@0	359
rliterman@0	360 The data:
rliterman@0	361 - Assemblies from soil isolates (Sample_(A-O).fasta)
rliterman@0	362
rliterman@0	363 In this case, we want to use --runmode snp, because we want to calculate the pairwise distances between all isolates and generate alignments. We will let RefChooser choose the best reference genome.
rliterman@0	364
rliterman@0	365 ```
rliterman@0	366 nextflow run CSP2.nf --out Test_Output/Soil_Analysis --runmode snp --fasta assets/SNP/
rliterman@0	367 ```
rliterman@0	368 ```
rliterman@0	369 nextflow run CSP2.nf // Run CSP2
rliterman@0	370 --out Test_Output/Soil_Analysis // Save results to ./Test_Output/Soil_Analysis
rliterman@0	371 --runmode snp // Compare all queries to each other
rliterman@0	372 --fasta assets/SNP // Gather query assemblies from this directory (you can also point to a text file containing multiple FASTA paths)
rliterman@0	373 ```
rliterman@0	374
rliterman@0	375 ### Output
rliterman@0	376
rliterman@0	377 If all went well, you should see something like this:
rliterman@0	378 <p align="center">
rliterman@0	379 <img src="img/SNP.jpg" alt="Nextflow print output for SNP Pipeline run"/>
rliterman@0	380 </p>
rliterman@0	381
rliterman@0	382 From top to bottom, we can see that CSP2:
rliterman@0	383 - Skipped assembly steps, as queries were already assembled
rliterman@0	384 - Ran MASH to choose a suitable reference genome
rliterman@0	385 - Aligned each query to the reference using MUMmer
rliterman@0	386 - Ran the SNP pipeline workflow on the .snpdiffs files to generate alignments and SNP distance data
rliterman@0	387
rliterman@0	388 Let's take a look to see what was generated:
rliterman@0	389 ```
rliterman@0	390 ls Test_Output/Soil_Analysis
rliterman@0	391
rliterman@0	392 Isolate_Data.tsv
rliterman@0	393 logs/
rliterman@0	394 MUMmer_Output/
rliterman@0	395 Raw_MUMmer_Summary.tsv
rliterman@0	396 SNP_Analysis/
rliterman@0	397 snpdiffs/
rliterman@0	398 ```
rliterman@0	399 -Note: This output is available to inspect in [assets/SNP/Output](assets/SNP/Output)
rliterman@0	400
rliterman@0	401 Directories
rliterman@0	402 - logs: Directory where run logs are stored
rliterman@0	403 - MUMmer_Output: Directory where raw MUMmer .snps, .report, .1coords are stored
rliterman@0	404 - snpdiffs: Directory where .snpdiffs files are stored
rliterman@0	405 - SNP_Analysis: Directory where SNP pipeline results are stored
rliterman@0	406
rliterman@0	407 Files
rliterman@0	408
rliterman@0	409 - The log file for any SNP pipeline run contains lots of useful information, including runtimes and output locations.
rliterman@0	410
rliterman@0	411 ```
rliterman@0	412 cat logs/SNP_Sample_A.log
rliterman@0	413
rliterman@0	414 CSP2 SNP Pipeline Analysis
rliterman@0	415 Reference Isolate: Sample_A
rliterman@0	416 2024-10-17 11:37:44
rliterman@0	417 -------------------------------------------------------
rliterman@0	418
rliterman@0	419 Reading in SNPDiffs files...Done!
rliterman@0	420 - Read in 14 SNPdiffs files
rliterman@0	421 -------------------------------------------------------
rliterman@0	422
rliterman@0	423 - Not performing SNP edge rescuing (any SNPs found within 150bp of a query contig edge will be purged)...
rliterman@0	424 -------------------------------------------------------
rliterman@0	425
rliterman@0	426 Screening all queries against reference...Done!
rliterman@0	427 - Reference screening data saved to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/Reference_Screening.tsv
rliterman@0	428 - Of 14 comparisons, 14 covered at least 85.0% of the reference genome after removing poor alignments
rliterman@0	429 -------------------------------------------------------
rliterman@0	430
rliterman@0	431 Compiling SNPs across 14 samples...
rliterman@0	432 - 220 total SNPs detected across all samples...
rliterman@0	433 - 167 unique SNPs passed QC filtering in at least one sample...
rliterman@0	434 - 1 unique SNPs were within 150bp of a reference contig end and were not considered in any query...
rliterman@0	435 - Skipping edge resucing...
rliterman@0	436 - 53 unique SNPs were purged in all queries they were found in, and were not considered in the final dataset...
rliterman@0	437 - Processed coverage information...
rliterman@0	438 - SNP coverage information: /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/Locus_Categories.tsv
rliterman@0	439 - Query coverage information: /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/Query_Coverage.tsv
rliterman@0	440 -------------------------------------------------------
rliterman@0	441
rliterman@0	442 Processing alignment data...Done!
rliterman@0	443 - Saved alignment of 167 SNPs to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snpma.fasta
rliterman@0	444 - Saved ordered loc list to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snplist.txt
rliterman@0	445 - Preserving SNPs with at most 50.0% missing data...
rliterman@0	446 - Of 167 SNPs, 167 SNPs pass the 50.0% missing data threshold...
rliterman@0	447 - Saved preserved alignment to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snpma_preserved.fasta
rliterman@0	448 - Saved preserved ordered loc list to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snplist_preserved.txt
rliterman@0	449 -------------------------------------------------------
rliterman@0	450
rliterman@0	451 Processing pairwise comparisons files...Done!
rliterman@0	452 - Saved raw pairwise distances to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snp_distance_pairwise.tsv
rliterman@0	453 - Saved raw pairwise matrix to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snp_distance_matrix.tsv
rliterman@0	454 - Saved preserved pairwise distances to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snp_distance_pairwise_preserved.tsv
rliterman@0	455 - Saved preserved pairwise matrix to /flash/storage/scratch/Robert.Literman/NextFlow/Test_CSP2/241024_Test_Output/Soil_Analysis/SNP_Analysis/Sample_A/snp_distance_matrix_preserved.tsv
rliterman@0	456 Total Time: 1.45 seconds
rliterman@0	457 -------------------------------------------------------
rliterman@0	458 ```
rliterman@0	459
rliterman@0	460 - For each reference sample there will be a folder called SNP_Analysis/REFERENCE_ID, containing:
rliterman@0	461 - CSP2_SNP_Pipeline.log: Log file for SNP Pipeline run
rliterman@0	462 - Locus_Categories.tsv: A TSV file containing information about each SNP and how many isolates had missing or purged data
rliterman@0	463 - Query_Coverage.tsv: A TSV file containing information about each query and how many sites had missing or purged data
rliterman@0	464 - Reference_Screening.tsv: Output from CSP2 Screening mode
rliterman@0	465 - snpdiffs.txt: File containing a list of all snpdiffs files used in the analysis
rliterman@0	466 - snp_distance_matrix.tsv: SNP distances between isolates in matrix format
rliterman@0	467 - snp_distance_pairwise.tsv: SNP distances between isolates in column format
rliterman@0	468 - snplist.txt: List of reference loc IDs (Reference_Contig/End_Position)
rliterman@0	469 - snpma.fasta: Alignment of SNP data
rliterman@0	470
rliterman@0	471 - If using the --max_missing argument, CSP2 will also output files where SNPs with too many missing data are removed:
rliterman@0	472
rliterman@0	473 - snp_distance_matrix_preserved.tsv
rliterman@0	474 - snp_distance_pairwise_preserved.tsv
rliterman@0	475 - snplist_preserved.txt
rliterman@0	476 - snpma_preserved.fasta
rliterman@0	477
rliterman@0	478 To see if the reference genome has an impact on SNP distance estimation, you can test one or more via --ref_fasta / --ref_reads or have CSP2 choose --n_ref isolates
rliterman@0	479 ```
rliterman@0	480 nextflow run CSP2.nf --out Test_Output/Soil_Analysis --runmode snp --fasta assets/SNP/ --n_ref 3
rliterman@0	481 ```
rliterman@0	482 ```
rliterman@0	483 nextflow run CSP2.nf // Run CSP2
rliterman@0	484 --out Test_Output/Soil_Analysis // Save results to ./Test_Output/Soil_Analysis
rliterman@0	485 --runmode snp // Compare all queries to each other
rliterman@0	486 --fasta assets/SNP // Gather query assemblies from this directory
rliterman@0	487 --n_ref 3 // Choose the top 3 references and run 3 sepearate analyses
rliterman@0	488 ```

Mercurial > repos > rliterman > csp2

annotate CSP2/README.md @ 53:a21f63856acf