Mercurial > repos > rliterman > csp2

--------------------------------------------------------------------------------
dnadiff is a wrapper for nucmer and analysis utilities that provides
detailed information on the differences between two genomes, and also
provides a high level report file that quantifies the differences
between the two inputs.

Use Cases:
    + diff'ing two strains of the same species
    + diff'ing two assemblies of the same organism
    + diff'ing a draft assembly and a closely related finished genome

If any of this code is used in any publication, please cite the following:

  Versatile and open software for comparing large genomes.
  S. Kurtz, A. Phillippy, A.L. Delcher,
  M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
  Genome Biology (2004), 5:R12.

--------------------------------------------------------------------------------

This manual is also available as HTML documentation included in this
distribution, or at:

   http://mummer.sourceforge.net
   http://mummer.sourceforge.net/manual
   http://mummer.sourceforge.net/examples


-- DESCRIPTION --
   dnadiff is a wrapper around nucmer that builds an alignment using
default parameters, and runs many of nucmer's helper scripts to
process the output and report alignment statistics, SNPs, breakpoints,
etc. It is designed for evaluating the sequence and structural
similarity of two highly similar sequence sets. E.g. comparing two
different assemblies of the same organism, or comparing two strains of
the same species.


-- dnadiff EXAMPLE --
To compare two strains of the same species, type:

"dnadiff genome1.fna genome2.fna"

Output will be...
   out.report  - Summary of alignments, differences and SNPs
   out.delta   - Standard nucmer alignment output
   out.1delta  - 1-to-1 alignment from delta-filter -1
   out.mdelta  - M-to-M alignment from delta-filter -m
   out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
   out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
   out.snps    - SNPs from show-snps -rlTHC .1delta
   out.rdiff   - Classified ref breakpoints from show-diff -rH .mdelta
   out.qdiff   - Classified qry breakpoints from show-diff -qH .mdelta
   out.unref   - Unaligned reference sequence IDs and lengths
   out.unqry   - Unaligned query sequence IDs and lengths

For more information on the formats and meanings of all the files
produced, please see the documentation for the corresponding
utility. This document serves to describe running the dnadiff script
and interpreting the produced .report file.


-- RUNNING 'dnadiff' --

  USAGE: dnadiff  [options]  <Reference>  <Query>
    or   dnadiff  [options]  -d <Delta File>

  DESCRIPTION:
    Run comparative analysis of two sequence sets using nucmer and its
    associated utilities with recommended parameters. See MUMmer
    documentation for a more detailed description of the
    output. Produces the following output files:

    .delta   - Standard nucmer alignment output
    .1delta  - 1-to-1 alignment from delta-filter -1
    .mdelta  - M-to-M alignment from delta-filter -m
    .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
    .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
    .snps    - SNPs from show-snps -rlTHC .1delta
    .rdiff   - Classified alignment breakpoints from show-diff -rH .mdelta
    .qdiff   - Classified alignment breakpoints from show-diff -qH .mdelta
    .report  - Summary of alignments, differences and SNPs
    .unref   - Unaligned reference sequence IDs and lengths
    .unqry   - Unaligned query sequence IDs and lengths

  MANDATORY:
    Reference       Set the input reference multi-FASTA filename
    Query           Set the input query multi-FASTA filename
      or
    Delta File      Unfiltered .delta alignment file from nucmer

  OPTIONS:
    -d|delta        Provide precomputed delta file for analysis
    -h
    --help          Display help information and exit
    -p|prefix       Set the prefix of the output files (default "out")
    -V
    --version       Display the version information and exit


-- NOTES --
   The -p option is recommended to avoid overwriting previous
output. A simple naming convention is for files A.fna and B.fna, to
set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
avoid using the -d option unless the delta file was already generated
with "nucmer --maxmatch" and has not been filtered.


-- OUTPUT FILES --
   dnadiff produces many outputs, however all but one are produced by
other utilities in the MUMmer package. Please see their corresponding
documentation for more information. This section will only describe
the .report file generated by dnadiff and tips on interpreting it.


 *** .report OUTPUT ***

   Report statistics are broken into two columns - reference and
query. Rows are grouped by themed alignment metrics and are described
here. Summary counts are estimates and do not represent the exact
number of occurrences of a particular evolutionary event. When reading
a reference column, think number of XYZ in reference with regard to
the query. When reading a query column, think number of XYZ in query
with regard to the reference.

[Sequences]    - Sequence-centric stats.
TotalSeqs      - Total number of input sequences.
AlignedSeqs    - Number of input sequences with at least one alignment.
UnalignedSeqs  - Number of input sequences with no alignment.

[Bases]        - Base-pair-centric stats.
TotalBases     - Total number of bases in the input sequences.
AlignedBases   - Total number of bases contained within an alignment.
UnalignedBases - Total number of unaligned bases. This is a rough
                 measure for the amount of "unique" sequence in the
                 reference and query.

[Alignments]   - Alignment-centric stats.
1-to-1         - Number of alignment blocks comprising the 1-to-1
                 mapping of reference to query. This is a subset of
                 the M-to-M mapping, with repeats removed.
TotalLength    - Total length of 1-to-1 alignment blocks.
AvgLength      - Average length of 1-to-1 alignment blocks.
AvgIdentity    - Average identity of 1-to-1 alignment blocks.

M-to-M         - Number of alignment blocks comprising the
                 many-to-many mapping of reference to query. The
                 M-to-M mapping represents the smallest set of
                 alignments that maximize the coverage of both
                 reference and query. This is a superset of the 1-to-1
                 mapping.
TotalLength    - Total length of M-to-M alignment blocks.
AvgLength      - Average length of M-to-M alignment blocks.
AvgIdentity    - Average identity of M-to-M alignment blocks.

[Features]     - Structural alignment features, such as
                 rearrangements. These counts are rough estimates
                 based on an automated analysis of the
                 alignments. Features are identified by scanning the
                 reference (or query) from low to high, and noting the
                 positions where the query alignments are
                 inconsistently ordered or oriented with respect to
                 the reference.
Breakpoints    - Number of non-maximal alignment endpoints,
                 i.e. endpoints that do not occur at the beginning or
                 end of a sequence.
Relocations    - Number of breaks in the alignment where adjacent
                 1-to-1 alignment blocks are in the same sequence, but
                 not consistently ordered. A separate feature is
                 recorded for each end of a relocation, so this is
                 really a count of relocation endpoints.
Translocations - Number of breaks in the alignment where adjacent
                 1-to-1 alignment blocks are in different sequences. A
                 separate feature is recorded for each end of a
                 translocation, so this is really a count of
                 translocation endpoints.
Inversions     - Number of breaks in the alignment where adjacent
                 1-to-1 alignment blocks are inverted with respect to
                 one another. A separate feature is recorded for each
                 end of an inversion, so this is really a count of
                 inversion endpoints.

Insertions     - Rough count of insertion events. Note that this is
                 slightly different from "UnalignedBases" because it
                 counts duplications as insertions, whereas
                 UnalignedBases does not. Also, this count does not
                 included sequences that have no alignments as
                 insertions, whereas UnalignedBases does. Note than
                 insertions in R can be viewed as deletions from Q.
                 This number reports only "major" insertions defined
                 as insertions large enough to break an alignment.
                 Nucmer will align through smaller insertions of less
                 than ~60 bases. These smaller insertions are
                 reported in the "Indels" count below.
InsertionSum   - Rough sum of inserted sequence.
InsertionAvg   - Average length of insertion.

TandemIns      - Rough count of tandem duplication insertion
                 events. Note that expansions in R can be viewed as
                 collapses in Q.
TandemInsSum   - Rough sum of tandem duplication insertions.
TandemInsAvg   - Average length of tandem duplications.

[SNPs]         - Single Nucleotide Polymorphism counts.
TotalSNPs      - Total number of SNPs, same for both sequences.
XY             - X-to-Y SNP. For reference column, this means
                 reference 'X' to query 'Y'. For query column, this
                 means query 'X' to reference 'Y'. The same
                 convention applies below.

TotalGSNPs     - Single Nucleotide Polymorphisms bounded by 20 exact,
                 base-pair matches on both sides.

TotalIndels    - Single Nucleotide Insertions/Deleltions.
X.             - X insertion. For reference column, 'X.' means
                 insertion of 'X' in the reference. For query column,
                 'X.' means insertion of 'X' in the query. Nucmer will
                 align through group insertions of up to ~60 bases.
                 Each base of these group insertions will be reported
                 in this count. Large insertions will be reported in
                 the "Insertions" count about.

TotalGIndels   - Single Nucleotide Insertions/Deleltions bounded by 20
                 exact, base-pair matches on both sides.
author	jpayne
date	Tue, 18 Mar 2025 17:55:14 -0400
parents
children