jpayne@69: -------------------------------------------------------------------------------- jpayne@69: dnadiff is a wrapper for nucmer and analysis utilities that provides jpayne@69: detailed information on the differences between two genomes, and also jpayne@69: provides a high level report file that quantifies the differences jpayne@69: between the two inputs. jpayne@69: jpayne@69: Use Cases: jpayne@69: + diff'ing two strains of the same species jpayne@69: + diff'ing two assemblies of the same organism jpayne@69: + diff'ing a draft assembly and a closely related finished genome jpayne@69: jpayne@69: If any of this code is used in any publication, please cite the following: jpayne@69: jpayne@69: Versatile and open software for comparing large genomes. jpayne@69: S. Kurtz, A. Phillippy, A.L. Delcher, jpayne@69: M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. jpayne@69: Genome Biology (2004), 5:R12. jpayne@69: jpayne@69: -------------------------------------------------------------------------------- jpayne@69: jpayne@69: This manual is also available as HTML documentation included in this jpayne@69: distribution, or at: jpayne@69: jpayne@69: http://mummer.sourceforge.net jpayne@69: http://mummer.sourceforge.net/manual jpayne@69: http://mummer.sourceforge.net/examples jpayne@69: jpayne@69: jpayne@69: -- DESCRIPTION -- jpayne@69: dnadiff is a wrapper around nucmer that builds an alignment using jpayne@69: default parameters, and runs many of nucmer's helper scripts to jpayne@69: process the output and report alignment statistics, SNPs, breakpoints, jpayne@69: etc. It is designed for evaluating the sequence and structural jpayne@69: similarity of two highly similar sequence sets. E.g. comparing two jpayne@69: different assemblies of the same organism, or comparing two strains of jpayne@69: the same species. jpayne@69: jpayne@69: jpayne@69: -- dnadiff EXAMPLE -- jpayne@69: To compare two strains of the same species, type: jpayne@69: jpayne@69: "dnadiff genome1.fna genome2.fna" jpayne@69: jpayne@69: Output will be... jpayne@69: out.report - Summary of alignments, differences and SNPs jpayne@69: out.delta - Standard nucmer alignment output jpayne@69: out.1delta - 1-to-1 alignment from delta-filter -1 jpayne@69: out.mdelta - M-to-M alignment from delta-filter -m jpayne@69: out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta jpayne@69: out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta jpayne@69: out.snps - SNPs from show-snps -rlTHC .1delta jpayne@69: out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta jpayne@69: out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta jpayne@69: out.unref - Unaligned reference sequence IDs and lengths jpayne@69: out.unqry - Unaligned query sequence IDs and lengths jpayne@69: jpayne@69: For more information on the formats and meanings of all the files jpayne@69: produced, please see the documentation for the corresponding jpayne@69: utility. This document serves to describe running the dnadiff script jpayne@69: and interpreting the produced .report file. jpayne@69: jpayne@69: jpayne@69: -- RUNNING 'dnadiff' -- jpayne@69: jpayne@69: USAGE: dnadiff [options] jpayne@69: or dnadiff [options] -d jpayne@69: jpayne@69: DESCRIPTION: jpayne@69: Run comparative analysis of two sequence sets using nucmer and its jpayne@69: associated utilities with recommended parameters. See MUMmer jpayne@69: documentation for a more detailed description of the jpayne@69: output. Produces the following output files: jpayne@69: jpayne@69: .delta - Standard nucmer alignment output jpayne@69: .1delta - 1-to-1 alignment from delta-filter -1 jpayne@69: .mdelta - M-to-M alignment from delta-filter -m jpayne@69: .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta jpayne@69: .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta jpayne@69: .snps - SNPs from show-snps -rlTHC .1delta jpayne@69: .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta jpayne@69: .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta jpayne@69: .report - Summary of alignments, differences and SNPs jpayne@69: .unref - Unaligned reference sequence IDs and lengths jpayne@69: .unqry - Unaligned query sequence IDs and lengths jpayne@69: jpayne@69: MANDATORY: jpayne@69: Reference Set the input reference multi-FASTA filename jpayne@69: Query Set the input query multi-FASTA filename jpayne@69: or jpayne@69: Delta File Unfiltered .delta alignment file from nucmer jpayne@69: jpayne@69: OPTIONS: jpayne@69: -d|delta Provide precomputed delta file for analysis jpayne@69: -h jpayne@69: --help Display help information and exit jpayne@69: -p|prefix Set the prefix of the output files (default "out") jpayne@69: -V jpayne@69: --version Display the version information and exit jpayne@69: jpayne@69: jpayne@69: -- NOTES -- jpayne@69: The -p option is recommended to avoid overwriting previous jpayne@69: output. A simple naming convention is for files A.fna and B.fna, to jpayne@69: set "-p A_B". It is safest to let dnadiff run nucmer automatically, so jpayne@69: avoid using the -d option unless the delta file was already generated jpayne@69: with "nucmer --maxmatch" and has not been filtered. jpayne@69: jpayne@69: jpayne@69: -- OUTPUT FILES -- jpayne@69: dnadiff produces many outputs, however all but one are produced by jpayne@69: other utilities in the MUMmer package. Please see their corresponding jpayne@69: documentation for more information. This section will only describe jpayne@69: the .report file generated by dnadiff and tips on interpreting it. jpayne@69: jpayne@69: jpayne@69: *** .report OUTPUT *** jpayne@69: jpayne@69: Report statistics are broken into two columns - reference and jpayne@69: query. Rows are grouped by themed alignment metrics and are described jpayne@69: here. Summary counts are estimates and do not represent the exact jpayne@69: number of occurrences of a particular evolutionary event. When reading jpayne@69: a reference column, think number of XYZ in reference with regard to jpayne@69: the query. When reading a query column, think number of XYZ in query jpayne@69: with regard to the reference. jpayne@69: jpayne@69: [Sequences] - Sequence-centric stats. jpayne@69: TotalSeqs - Total number of input sequences. jpayne@69: AlignedSeqs - Number of input sequences with at least one alignment. jpayne@69: UnalignedSeqs - Number of input sequences with no alignment. jpayne@69: jpayne@69: [Bases] - Base-pair-centric stats. jpayne@69: TotalBases - Total number of bases in the input sequences. jpayne@69: AlignedBases - Total number of bases contained within an alignment. jpayne@69: UnalignedBases - Total number of unaligned bases. This is a rough jpayne@69: measure for the amount of "unique" sequence in the jpayne@69: reference and query. jpayne@69: jpayne@69: [Alignments] - Alignment-centric stats. jpayne@69: 1-to-1 - Number of alignment blocks comprising the 1-to-1 jpayne@69: mapping of reference to query. This is a subset of jpayne@69: the M-to-M mapping, with repeats removed. jpayne@69: TotalLength - Total length of 1-to-1 alignment blocks. jpayne@69: AvgLength - Average length of 1-to-1 alignment blocks. jpayne@69: AvgIdentity - Average identity of 1-to-1 alignment blocks. jpayne@69: jpayne@69: M-to-M - Number of alignment blocks comprising the jpayne@69: many-to-many mapping of reference to query. The jpayne@69: M-to-M mapping represents the smallest set of jpayne@69: alignments that maximize the coverage of both jpayne@69: reference and query. This is a superset of the 1-to-1 jpayne@69: mapping. jpayne@69: TotalLength - Total length of M-to-M alignment blocks. jpayne@69: AvgLength - Average length of M-to-M alignment blocks. jpayne@69: AvgIdentity - Average identity of M-to-M alignment blocks. jpayne@69: jpayne@69: [Features] - Structural alignment features, such as jpayne@69: rearrangements. These counts are rough estimates jpayne@69: based on an automated analysis of the jpayne@69: alignments. Features are identified by scanning the jpayne@69: reference (or query) from low to high, and noting the jpayne@69: positions where the query alignments are jpayne@69: inconsistently ordered or oriented with respect to jpayne@69: the reference. jpayne@69: Breakpoints - Number of non-maximal alignment endpoints, jpayne@69: i.e. endpoints that do not occur at the beginning or jpayne@69: end of a sequence. jpayne@69: Relocations - Number of breaks in the alignment where adjacent jpayne@69: 1-to-1 alignment blocks are in the same sequence, but jpayne@69: not consistently ordered. A separate feature is jpayne@69: recorded for each end of a relocation, so this is jpayne@69: really a count of relocation endpoints. jpayne@69: Translocations - Number of breaks in the alignment where adjacent jpayne@69: 1-to-1 alignment blocks are in different sequences. A jpayne@69: separate feature is recorded for each end of a jpayne@69: translocation, so this is really a count of jpayne@69: translocation endpoints. jpayne@69: Inversions - Number of breaks in the alignment where adjacent jpayne@69: 1-to-1 alignment blocks are inverted with respect to jpayne@69: one another. A separate feature is recorded for each jpayne@69: end of an inversion, so this is really a count of jpayne@69: inversion endpoints. jpayne@69: jpayne@69: Insertions - Rough count of insertion events. Note that this is jpayne@69: slightly different from "UnalignedBases" because it jpayne@69: counts duplications as insertions, whereas jpayne@69: UnalignedBases does not. Also, this count does not jpayne@69: included sequences that have no alignments as jpayne@69: insertions, whereas UnalignedBases does. Note than jpayne@69: insertions in R can be viewed as deletions from Q. jpayne@69: This number reports only "major" insertions defined jpayne@69: as insertions large enough to break an alignment. jpayne@69: Nucmer will align through smaller insertions of less jpayne@69: than ~60 bases. These smaller insertions are jpayne@69: reported in the "Indels" count below. jpayne@69: InsertionSum - Rough sum of inserted sequence. jpayne@69: InsertionAvg - Average length of insertion. jpayne@69: jpayne@69: TandemIns - Rough count of tandem duplication insertion jpayne@69: events. Note that expansions in R can be viewed as jpayne@69: collapses in Q. jpayne@69: TandemInsSum - Rough sum of tandem duplication insertions. jpayne@69: TandemInsAvg - Average length of tandem duplications. jpayne@69: jpayne@69: [SNPs] - Single Nucleotide Polymorphism counts. jpayne@69: TotalSNPs - Total number of SNPs, same for both sequences. jpayne@69: XY - X-to-Y SNP. For reference column, this means jpayne@69: reference 'X' to query 'Y'. For query column, this jpayne@69: means query 'X' to reference 'Y'. The same jpayne@69: convention applies below. jpayne@69: jpayne@69: TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact, jpayne@69: base-pair matches on both sides. jpayne@69: jpayne@69: TotalIndels - Single Nucleotide Insertions/Deleltions. jpayne@69: X. - X insertion. For reference column, 'X.' means jpayne@69: insertion of 'X' in the reference. For query column, jpayne@69: 'X.' means insertion of 'X' in the query. Nucmer will jpayne@69: align through group insertions of up to ~60 bases. jpayne@69: Each base of these group insertions will be reported jpayne@69: in this count. Large insertions will be reported in jpayne@69: the "Insertions" count about. jpayne@69: jpayne@69: TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20 jpayne@69: exact, base-pair matches on both sides.