Mercurial > repos > rliterman > csp2
diff CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README Tue Mar 18 17:55:14 2025 -0400 @@ -0,0 +1,224 @@ +-------------------------------------------------------------------------------- +dnadiff is a wrapper for nucmer and analysis utilities that provides +detailed information on the differences between two genomes, and also +provides a high level report file that quantifies the differences +between the two inputs. + +Use Cases: + + diff'ing two strains of the same species + + diff'ing two assemblies of the same organism + + diff'ing a draft assembly and a closely related finished genome + +If any of this code is used in any publication, please cite the following: + + Versatile and open software for comparing large genomes. + S. Kurtz, A. Phillippy, A.L. Delcher, + M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. + Genome Biology (2004), 5:R12. + +-------------------------------------------------------------------------------- + +This manual is also available as HTML documentation included in this +distribution, or at: + + http://mummer.sourceforge.net + http://mummer.sourceforge.net/manual + http://mummer.sourceforge.net/examples + + +-- DESCRIPTION -- + dnadiff is a wrapper around nucmer that builds an alignment using +default parameters, and runs many of nucmer's helper scripts to +process the output and report alignment statistics, SNPs, breakpoints, +etc. It is designed for evaluating the sequence and structural +similarity of two highly similar sequence sets. E.g. comparing two +different assemblies of the same organism, or comparing two strains of +the same species. + + +-- dnadiff EXAMPLE -- +To compare two strains of the same species, type: + +"dnadiff genome1.fna genome2.fna" + +Output will be... + out.report - Summary of alignments, differences and SNPs + out.delta - Standard nucmer alignment output + out.1delta - 1-to-1 alignment from delta-filter -1 + out.mdelta - M-to-M alignment from delta-filter -m + out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta + out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta + out.snps - SNPs from show-snps -rlTHC .1delta + out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta + out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta + out.unref - Unaligned reference sequence IDs and lengths + out.unqry - Unaligned query sequence IDs and lengths + +For more information on the formats and meanings of all the files +produced, please see the documentation for the corresponding +utility. This document serves to describe running the dnadiff script +and interpreting the produced .report file. + + +-- RUNNING 'dnadiff' -- + + USAGE: dnadiff [options] <Reference> <Query> + or dnadiff [options] -d <Delta File> + + DESCRIPTION: + Run comparative analysis of two sequence sets using nucmer and its + associated utilities with recommended parameters. See MUMmer + documentation for a more detailed description of the + output. Produces the following output files: + + .delta - Standard nucmer alignment output + .1delta - 1-to-1 alignment from delta-filter -1 + .mdelta - M-to-M alignment from delta-filter -m + .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta + .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta + .snps - SNPs from show-snps -rlTHC .1delta + .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta + .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta + .report - Summary of alignments, differences and SNPs + .unref - Unaligned reference sequence IDs and lengths + .unqry - Unaligned query sequence IDs and lengths + + MANDATORY: + Reference Set the input reference multi-FASTA filename + Query Set the input query multi-FASTA filename + or + Delta File Unfiltered .delta alignment file from nucmer + + OPTIONS: + -d|delta Provide precomputed delta file for analysis + -h + --help Display help information and exit + -p|prefix Set the prefix of the output files (default "out") + -V + --version Display the version information and exit + + +-- NOTES -- + The -p option is recommended to avoid overwriting previous +output. A simple naming convention is for files A.fna and B.fna, to +set "-p A_B". It is safest to let dnadiff run nucmer automatically, so +avoid using the -d option unless the delta file was already generated +with "nucmer --maxmatch" and has not been filtered. + + +-- OUTPUT FILES -- + dnadiff produces many outputs, however all but one are produced by +other utilities in the MUMmer package. Please see their corresponding +documentation for more information. This section will only describe +the .report file generated by dnadiff and tips on interpreting it. + + + *** .report OUTPUT *** + + Report statistics are broken into two columns - reference and +query. Rows are grouped by themed alignment metrics and are described +here. Summary counts are estimates and do not represent the exact +number of occurrences of a particular evolutionary event. When reading +a reference column, think number of XYZ in reference with regard to +the query. When reading a query column, think number of XYZ in query +with regard to the reference. + +[Sequences] - Sequence-centric stats. +TotalSeqs - Total number of input sequences. +AlignedSeqs - Number of input sequences with at least one alignment. +UnalignedSeqs - Number of input sequences with no alignment. + +[Bases] - Base-pair-centric stats. +TotalBases - Total number of bases in the input sequences. +AlignedBases - Total number of bases contained within an alignment. +UnalignedBases - Total number of unaligned bases. This is a rough + measure for the amount of "unique" sequence in the + reference and query. + +[Alignments] - Alignment-centric stats. +1-to-1 - Number of alignment blocks comprising the 1-to-1 + mapping of reference to query. This is a subset of + the M-to-M mapping, with repeats removed. +TotalLength - Total length of 1-to-1 alignment blocks. +AvgLength - Average length of 1-to-1 alignment blocks. +AvgIdentity - Average identity of 1-to-1 alignment blocks. + +M-to-M - Number of alignment blocks comprising the + many-to-many mapping of reference to query. The + M-to-M mapping represents the smallest set of + alignments that maximize the coverage of both + reference and query. This is a superset of the 1-to-1 + mapping. +TotalLength - Total length of M-to-M alignment blocks. +AvgLength - Average length of M-to-M alignment blocks. +AvgIdentity - Average identity of M-to-M alignment blocks. + +[Features] - Structural alignment features, such as + rearrangements. These counts are rough estimates + based on an automated analysis of the + alignments. Features are identified by scanning the + reference (or query) from low to high, and noting the + positions where the query alignments are + inconsistently ordered or oriented with respect to + the reference. +Breakpoints - Number of non-maximal alignment endpoints, + i.e. endpoints that do not occur at the beginning or + end of a sequence. +Relocations - Number of breaks in the alignment where adjacent + 1-to-1 alignment blocks are in the same sequence, but + not consistently ordered. A separate feature is + recorded for each end of a relocation, so this is + really a count of relocation endpoints. +Translocations - Number of breaks in the alignment where adjacent + 1-to-1 alignment blocks are in different sequences. A + separate feature is recorded for each end of a + translocation, so this is really a count of + translocation endpoints. +Inversions - Number of breaks in the alignment where adjacent + 1-to-1 alignment blocks are inverted with respect to + one another. A separate feature is recorded for each + end of an inversion, so this is really a count of + inversion endpoints. + +Insertions - Rough count of insertion events. Note that this is + slightly different from "UnalignedBases" because it + counts duplications as insertions, whereas + UnalignedBases does not. Also, this count does not + included sequences that have no alignments as + insertions, whereas UnalignedBases does. Note than + insertions in R can be viewed as deletions from Q. + This number reports only "major" insertions defined + as insertions large enough to break an alignment. + Nucmer will align through smaller insertions of less + than ~60 bases. These smaller insertions are + reported in the "Indels" count below. +InsertionSum - Rough sum of inserted sequence. +InsertionAvg - Average length of insertion. + +TandemIns - Rough count of tandem duplication insertion + events. Note that expansions in R can be viewed as + collapses in Q. +TandemInsSum - Rough sum of tandem duplication insertions. +TandemInsAvg - Average length of tandem duplications. + +[SNPs] - Single Nucleotide Polymorphism counts. +TotalSNPs - Total number of SNPs, same for both sequences. +XY - X-to-Y SNP. For reference column, this means + reference 'X' to query 'Y'. For query column, this + means query 'X' to reference 'Y'. The same + convention applies below. + +TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact, + base-pair matches on both sides. + +TotalIndels - Single Nucleotide Insertions/Deleltions. +X. - X insertion. For reference column, 'X.' means + insertion of 'X' in the reference. For query column, + 'X.' means insertion of 'X' in the query. Nucmer will + align through group insertions of up to ~60 bases. + Each base of these group insertions will be reported + in this count. Large insertions will be reported in + the "Insertions" count about. + +TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20 + exact, base-pair matches on both sides.