jpayne@69: --------------------------------------------------------------------------------
jpayne@69: dnadiff is a wrapper for nucmer and analysis utilities that provides
jpayne@69: detailed information on the differences between two genomes, and also
jpayne@69: provides a high level report file that quantifies the differences
jpayne@69: between the two inputs.
jpayne@69: 
jpayne@69: Use Cases:
jpayne@69:     + diff'ing two strains of the same species
jpayne@69:     + diff'ing two assemblies of the same organism
jpayne@69:     + diff'ing a draft assembly and a closely related finished genome
jpayne@69: 
jpayne@69: If any of this code is used in any publication, please cite the following:
jpayne@69: 
jpayne@69:   Versatile and open software for comparing large genomes.
jpayne@69:   S. Kurtz, A. Phillippy, A.L. Delcher,
jpayne@69:   M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
jpayne@69:   Genome Biology (2004), 5:R12.
jpayne@69: 
jpayne@69: --------------------------------------------------------------------------------
jpayne@69: 
jpayne@69: This manual is also available as HTML documentation included in this
jpayne@69: distribution, or at:
jpayne@69: 
jpayne@69:    http://mummer.sourceforge.net
jpayne@69:    http://mummer.sourceforge.net/manual
jpayne@69:    http://mummer.sourceforge.net/examples
jpayne@69: 
jpayne@69: 
jpayne@69: -- DESCRIPTION --
jpayne@69:    dnadiff is a wrapper around nucmer that builds an alignment using
jpayne@69: default parameters, and runs many of nucmer's helper scripts to
jpayne@69: process the output and report alignment statistics, SNPs, breakpoints,
jpayne@69: etc. It is designed for evaluating the sequence and structural
jpayne@69: similarity of two highly similar sequence sets. E.g. comparing two
jpayne@69: different assemblies of the same organism, or comparing two strains of
jpayne@69: the same species.
jpayne@69: 
jpayne@69: 
jpayne@69: -- dnadiff EXAMPLE --
jpayne@69: To compare two strains of the same species, type:
jpayne@69: 
jpayne@69: "dnadiff genome1.fna genome2.fna"
jpayne@69: 
jpayne@69: Output will be...
jpayne@69:    out.report  - Summary of alignments, differences and SNPs
jpayne@69:    out.delta   - Standard nucmer alignment output
jpayne@69:    out.1delta  - 1-to-1 alignment from delta-filter -1
jpayne@69:    out.mdelta  - M-to-M alignment from delta-filter -m
jpayne@69:    out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
jpayne@69:    out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
jpayne@69:    out.snps    - SNPs from show-snps -rlTHC .1delta
jpayne@69:    out.rdiff   - Classified ref breakpoints from show-diff -rH .mdelta
jpayne@69:    out.qdiff   - Classified qry breakpoints from show-diff -qH .mdelta
jpayne@69:    out.unref   - Unaligned reference sequence IDs and lengths
jpayne@69:    out.unqry   - Unaligned query sequence IDs and lengths
jpayne@69: 
jpayne@69: For more information on the formats and meanings of all the files
jpayne@69: produced, please see the documentation for the corresponding
jpayne@69: utility. This document serves to describe running the dnadiff script
jpayne@69: and interpreting the produced .report file.
jpayne@69: 
jpayne@69: 
jpayne@69: -- RUNNING 'dnadiff' --
jpayne@69: 
jpayne@69:   USAGE: dnadiff  [options]  <Reference>  <Query>
jpayne@69:     or   dnadiff  [options]  -d <Delta File>
jpayne@69: 
jpayne@69:   DESCRIPTION:
jpayne@69:     Run comparative analysis of two sequence sets using nucmer and its
jpayne@69:     associated utilities with recommended parameters. See MUMmer
jpayne@69:     documentation for a more detailed description of the
jpayne@69:     output. Produces the following output files:
jpayne@69: 
jpayne@69:     .delta   - Standard nucmer alignment output
jpayne@69:     .1delta  - 1-to-1 alignment from delta-filter -1
jpayne@69:     .mdelta  - M-to-M alignment from delta-filter -m
jpayne@69:     .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
jpayne@69:     .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
jpayne@69:     .snps    - SNPs from show-snps -rlTHC .1delta
jpayne@69:     .rdiff   - Classified alignment breakpoints from show-diff -rH .mdelta
jpayne@69:     .qdiff   - Classified alignment breakpoints from show-diff -qH .mdelta
jpayne@69:     .report  - Summary of alignments, differences and SNPs
jpayne@69:     .unref   - Unaligned reference sequence IDs and lengths
jpayne@69:     .unqry   - Unaligned query sequence IDs and lengths
jpayne@69: 
jpayne@69:   MANDATORY:
jpayne@69:     Reference       Set the input reference multi-FASTA filename
jpayne@69:     Query           Set the input query multi-FASTA filename
jpayne@69:       or
jpayne@69:     Delta File      Unfiltered .delta alignment file from nucmer
jpayne@69: 
jpayne@69:   OPTIONS:
jpayne@69:     -d|delta        Provide precomputed delta file for analysis
jpayne@69:     -h
jpayne@69:     --help          Display help information and exit
jpayne@69:     -p|prefix       Set the prefix of the output files (default "out")
jpayne@69:     -V
jpayne@69:     --version       Display the version information and exit
jpayne@69: 
jpayne@69: 
jpayne@69: -- NOTES --
jpayne@69:    The -p option is recommended to avoid overwriting previous
jpayne@69: output. A simple naming convention is for files A.fna and B.fna, to
jpayne@69: set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
jpayne@69: avoid using the -d option unless the delta file was already generated
jpayne@69: with "nucmer --maxmatch" and has not been filtered.
jpayne@69: 
jpayne@69: 
jpayne@69: -- OUTPUT FILES --
jpayne@69:    dnadiff produces many outputs, however all but one are produced by
jpayne@69: other utilities in the MUMmer package. Please see their corresponding
jpayne@69: documentation for more information. This section will only describe
jpayne@69: the .report file generated by dnadiff and tips on interpreting it.
jpayne@69: 
jpayne@69: 
jpayne@69:  *** .report OUTPUT ***
jpayne@69: 
jpayne@69:    Report statistics are broken into two columns - reference and
jpayne@69: query. Rows are grouped by themed alignment metrics and are described
jpayne@69: here. Summary counts are estimates and do not represent the exact
jpayne@69: number of occurrences of a particular evolutionary event. When reading
jpayne@69: a reference column, think number of XYZ in reference with regard to
jpayne@69: the query. When reading a query column, think number of XYZ in query
jpayne@69: with regard to the reference.
jpayne@69: 
jpayne@69: [Sequences]    - Sequence-centric stats.
jpayne@69: TotalSeqs      - Total number of input sequences.
jpayne@69: AlignedSeqs    - Number of input sequences with at least one alignment.
jpayne@69: UnalignedSeqs  - Number of input sequences with no alignment.
jpayne@69: 
jpayne@69: [Bases]        - Base-pair-centric stats.
jpayne@69: TotalBases     - Total number of bases in the input sequences.
jpayne@69: AlignedBases   - Total number of bases contained within an alignment.
jpayne@69: UnalignedBases - Total number of unaligned bases. This is a rough
jpayne@69:                  measure for the amount of "unique" sequence in the
jpayne@69:                  reference and query.
jpayne@69: 
jpayne@69: [Alignments]   - Alignment-centric stats.
jpayne@69: 1-to-1         - Number of alignment blocks comprising the 1-to-1
jpayne@69:                  mapping of reference to query. This is a subset of
jpayne@69:                  the M-to-M mapping, with repeats removed.
jpayne@69: TotalLength    - Total length of 1-to-1 alignment blocks.
jpayne@69: AvgLength      - Average length of 1-to-1 alignment blocks.
jpayne@69: AvgIdentity    - Average identity of 1-to-1 alignment blocks.
jpayne@69: 
jpayne@69: M-to-M         - Number of alignment blocks comprising the
jpayne@69:                  many-to-many mapping of reference to query. The
jpayne@69:                  M-to-M mapping represents the smallest set of
jpayne@69:                  alignments that maximize the coverage of both
jpayne@69:                  reference and query. This is a superset of the 1-to-1
jpayne@69:                  mapping.
jpayne@69: TotalLength    - Total length of M-to-M alignment blocks.
jpayne@69: AvgLength      - Average length of M-to-M alignment blocks.
jpayne@69: AvgIdentity    - Average identity of M-to-M alignment blocks.
jpayne@69: 
jpayne@69: [Features]     - Structural alignment features, such as
jpayne@69:                  rearrangements. These counts are rough estimates
jpayne@69:                  based on an automated analysis of the
jpayne@69:                  alignments. Features are identified by scanning the
jpayne@69:                  reference (or query) from low to high, and noting the
jpayne@69:                  positions where the query alignments are
jpayne@69:                  inconsistently ordered or oriented with respect to
jpayne@69:                  the reference.
jpayne@69: Breakpoints    - Number of non-maximal alignment endpoints,
jpayne@69:                  i.e. endpoints that do not occur at the beginning or
jpayne@69:                  end of a sequence.
jpayne@69: Relocations    - Number of breaks in the alignment where adjacent
jpayne@69:                  1-to-1 alignment blocks are in the same sequence, but
jpayne@69:                  not consistently ordered. A separate feature is
jpayne@69:                  recorded for each end of a relocation, so this is
jpayne@69:                  really a count of relocation endpoints.
jpayne@69: Translocations - Number of breaks in the alignment where adjacent
jpayne@69:                  1-to-1 alignment blocks are in different sequences. A
jpayne@69:                  separate feature is recorded for each end of a
jpayne@69:                  translocation, so this is really a count of
jpayne@69:                  translocation endpoints.
jpayne@69: Inversions     - Number of breaks in the alignment where adjacent
jpayne@69:                  1-to-1 alignment blocks are inverted with respect to
jpayne@69:                  one another. A separate feature is recorded for each
jpayne@69:                  end of an inversion, so this is really a count of
jpayne@69:                  inversion endpoints.
jpayne@69: 
jpayne@69: Insertions     - Rough count of insertion events. Note that this is
jpayne@69:                  slightly different from "UnalignedBases" because it
jpayne@69:                  counts duplications as insertions, whereas
jpayne@69:                  UnalignedBases does not. Also, this count does not
jpayne@69:                  included sequences that have no alignments as
jpayne@69:                  insertions, whereas UnalignedBases does. Note than
jpayne@69:                  insertions in R can be viewed as deletions from Q.
jpayne@69:                  This number reports only "major" insertions defined
jpayne@69:                  as insertions large enough to break an alignment.
jpayne@69:                  Nucmer will align through smaller insertions of less
jpayne@69:                  than ~60 bases. These smaller insertions are
jpayne@69:                  reported in the "Indels" count below.
jpayne@69: InsertionSum   - Rough sum of inserted sequence.
jpayne@69: InsertionAvg   - Average length of insertion.
jpayne@69: 
jpayne@69: TandemIns      - Rough count of tandem duplication insertion
jpayne@69:                  events. Note that expansions in R can be viewed as
jpayne@69:                  collapses in Q.
jpayne@69: TandemInsSum   - Rough sum of tandem duplication insertions.
jpayne@69: TandemInsAvg   - Average length of tandem duplications.
jpayne@69: 
jpayne@69: [SNPs]         - Single Nucleotide Polymorphism counts.
jpayne@69: TotalSNPs      - Total number of SNPs, same for both sequences.
jpayne@69: XY             - X-to-Y SNP. For reference column, this means
jpayne@69:                  reference 'X' to query 'Y'. For query column, this
jpayne@69:                  means query 'X' to reference 'Y'. The same
jpayne@69:                  convention applies below.
jpayne@69: 
jpayne@69: TotalGSNPs     - Single Nucleotide Polymorphisms bounded by 20 exact,
jpayne@69:                  base-pair matches on both sides.
jpayne@69: 
jpayne@69: TotalIndels    - Single Nucleotide Insertions/Deleltions.
jpayne@69: X.             - X insertion. For reference column, 'X.' means
jpayne@69:                  insertion of 'X' in the reference. For query column,
jpayne@69:                  'X.' means insertion of 'X' in the query. Nucmer will
jpayne@69:                  align through group insertions of up to ~60 bases.
jpayne@69:                  Each base of these group insertions will be reported
jpayne@69:                  in this count. Large insertions will be reported in
jpayne@69:                  the "Insertions" count about.
jpayne@69: 
jpayne@69: TotalGIndels   - Single Nucleotide Insertions/Deleltions bounded by 20
jpayne@69:                  exact, base-pair matches on both sides.