diff CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README	Tue Mar 18 17:55:14 2025 -0400
@@ -0,0 +1,224 @@
+--------------------------------------------------------------------------------
+dnadiff is a wrapper for nucmer and analysis utilities that provides
+detailed information on the differences between two genomes, and also
+provides a high level report file that quantifies the differences
+between the two inputs.
+
+Use Cases:
+    + diff'ing two strains of the same species
+    + diff'ing two assemblies of the same organism
+    + diff'ing a draft assembly and a closely related finished genome
+
+If any of this code is used in any publication, please cite the following:
+
+  Versatile and open software for comparing large genomes.
+  S. Kurtz, A. Phillippy, A.L. Delcher,
+  M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
+  Genome Biology (2004), 5:R12.
+
+--------------------------------------------------------------------------------
+
+This manual is also available as HTML documentation included in this
+distribution, or at:
+
+   http://mummer.sourceforge.net
+   http://mummer.sourceforge.net/manual
+   http://mummer.sourceforge.net/examples
+
+
+-- DESCRIPTION --
+   dnadiff is a wrapper around nucmer that builds an alignment using
+default parameters, and runs many of nucmer's helper scripts to
+process the output and report alignment statistics, SNPs, breakpoints,
+etc. It is designed for evaluating the sequence and structural
+similarity of two highly similar sequence sets. E.g. comparing two
+different assemblies of the same organism, or comparing two strains of
+the same species.
+
+
+-- dnadiff EXAMPLE --
+To compare two strains of the same species, type:
+
+"dnadiff genome1.fna genome2.fna"
+
+Output will be...
+   out.report  - Summary of alignments, differences and SNPs
+   out.delta   - Standard nucmer alignment output
+   out.1delta  - 1-to-1 alignment from delta-filter -1
+   out.mdelta  - M-to-M alignment from delta-filter -m
+   out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
+   out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
+   out.snps    - SNPs from show-snps -rlTHC .1delta
+   out.rdiff   - Classified ref breakpoints from show-diff -rH .mdelta
+   out.qdiff   - Classified qry breakpoints from show-diff -qH .mdelta
+   out.unref   - Unaligned reference sequence IDs and lengths
+   out.unqry   - Unaligned query sequence IDs and lengths
+
+For more information on the formats and meanings of all the files
+produced, please see the documentation for the corresponding
+utility. This document serves to describe running the dnadiff script
+and interpreting the produced .report file.
+
+
+-- RUNNING 'dnadiff' --
+
+  USAGE: dnadiff  [options]  <Reference>  <Query>
+    or   dnadiff  [options]  -d <Delta File>
+
+  DESCRIPTION:
+    Run comparative analysis of two sequence sets using nucmer and its
+    associated utilities with recommended parameters. See MUMmer
+    documentation for a more detailed description of the
+    output. Produces the following output files:
+
+    .delta   - Standard nucmer alignment output
+    .1delta  - 1-to-1 alignment from delta-filter -1
+    .mdelta  - M-to-M alignment from delta-filter -m
+    .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
+    .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
+    .snps    - SNPs from show-snps -rlTHC .1delta
+    .rdiff   - Classified alignment breakpoints from show-diff -rH .mdelta
+    .qdiff   - Classified alignment breakpoints from show-diff -qH .mdelta
+    .report  - Summary of alignments, differences and SNPs
+    .unref   - Unaligned reference sequence IDs and lengths
+    .unqry   - Unaligned query sequence IDs and lengths
+
+  MANDATORY:
+    Reference       Set the input reference multi-FASTA filename
+    Query           Set the input query multi-FASTA filename
+      or
+    Delta File      Unfiltered .delta alignment file from nucmer
+
+  OPTIONS:
+    -d|delta        Provide precomputed delta file for analysis
+    -h
+    --help          Display help information and exit
+    -p|prefix       Set the prefix of the output files (default "out")
+    -V
+    --version       Display the version information and exit
+
+
+-- NOTES --
+   The -p option is recommended to avoid overwriting previous
+output. A simple naming convention is for files A.fna and B.fna, to
+set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
+avoid using the -d option unless the delta file was already generated
+with "nucmer --maxmatch" and has not been filtered.
+
+
+-- OUTPUT FILES --
+   dnadiff produces many outputs, however all but one are produced by
+other utilities in the MUMmer package. Please see their corresponding
+documentation for more information. This section will only describe
+the .report file generated by dnadiff and tips on interpreting it.
+
+
+ *** .report OUTPUT ***
+
+   Report statistics are broken into two columns - reference and
+query. Rows are grouped by themed alignment metrics and are described
+here. Summary counts are estimates and do not represent the exact
+number of occurrences of a particular evolutionary event. When reading
+a reference column, think number of XYZ in reference with regard to
+the query. When reading a query column, think number of XYZ in query
+with regard to the reference.
+
+[Sequences]    - Sequence-centric stats.
+TotalSeqs      - Total number of input sequences.
+AlignedSeqs    - Number of input sequences with at least one alignment.
+UnalignedSeqs  - Number of input sequences with no alignment.
+
+[Bases]        - Base-pair-centric stats.
+TotalBases     - Total number of bases in the input sequences.
+AlignedBases   - Total number of bases contained within an alignment.
+UnalignedBases - Total number of unaligned bases. This is a rough
+                 measure for the amount of "unique" sequence in the
+                 reference and query.
+
+[Alignments]   - Alignment-centric stats.
+1-to-1         - Number of alignment blocks comprising the 1-to-1
+                 mapping of reference to query. This is a subset of
+                 the M-to-M mapping, with repeats removed.
+TotalLength    - Total length of 1-to-1 alignment blocks.
+AvgLength      - Average length of 1-to-1 alignment blocks.
+AvgIdentity    - Average identity of 1-to-1 alignment blocks.
+
+M-to-M         - Number of alignment blocks comprising the
+                 many-to-many mapping of reference to query. The
+                 M-to-M mapping represents the smallest set of
+                 alignments that maximize the coverage of both
+                 reference and query. This is a superset of the 1-to-1
+                 mapping.
+TotalLength    - Total length of M-to-M alignment blocks.
+AvgLength      - Average length of M-to-M alignment blocks.
+AvgIdentity    - Average identity of M-to-M alignment blocks.
+
+[Features]     - Structural alignment features, such as
+                 rearrangements. These counts are rough estimates
+                 based on an automated analysis of the
+                 alignments. Features are identified by scanning the
+                 reference (or query) from low to high, and noting the
+                 positions where the query alignments are
+                 inconsistently ordered or oriented with respect to
+                 the reference.
+Breakpoints    - Number of non-maximal alignment endpoints,
+                 i.e. endpoints that do not occur at the beginning or
+                 end of a sequence.
+Relocations    - Number of breaks in the alignment where adjacent
+                 1-to-1 alignment blocks are in the same sequence, but
+                 not consistently ordered. A separate feature is
+                 recorded for each end of a relocation, so this is
+                 really a count of relocation endpoints.
+Translocations - Number of breaks in the alignment where adjacent
+                 1-to-1 alignment blocks are in different sequences. A
+                 separate feature is recorded for each end of a
+                 translocation, so this is really a count of
+                 translocation endpoints.
+Inversions     - Number of breaks in the alignment where adjacent
+                 1-to-1 alignment blocks are inverted with respect to
+                 one another. A separate feature is recorded for each
+                 end of an inversion, so this is really a count of
+                 inversion endpoints.
+
+Insertions     - Rough count of insertion events. Note that this is
+                 slightly different from "UnalignedBases" because it
+                 counts duplications as insertions, whereas
+                 UnalignedBases does not. Also, this count does not
+                 included sequences that have no alignments as
+                 insertions, whereas UnalignedBases does. Note than
+                 insertions in R can be viewed as deletions from Q.
+                 This number reports only "major" insertions defined
+                 as insertions large enough to break an alignment.
+                 Nucmer will align through smaller insertions of less
+                 than ~60 bases. These smaller insertions are
+                 reported in the "Indels" count below.
+InsertionSum   - Rough sum of inserted sequence.
+InsertionAvg   - Average length of insertion.
+
+TandemIns      - Rough count of tandem duplication insertion
+                 events. Note that expansions in R can be viewed as
+                 collapses in Q.
+TandemInsSum   - Rough sum of tandem duplication insertions.
+TandemInsAvg   - Average length of tandem duplications.
+
+[SNPs]         - Single Nucleotide Polymorphism counts.
+TotalSNPs      - Total number of SNPs, same for both sequences.
+XY             - X-to-Y SNP. For reference column, this means
+                 reference 'X' to query 'Y'. For query column, this
+                 means query 'X' to reference 'Y'. The same
+                 convention applies below.
+
+TotalGSNPs     - Single Nucleotide Polymorphisms bounded by 20 exact,
+                 base-pair matches on both sides.
+
+TotalIndels    - Single Nucleotide Insertions/Deleltions.
+X.             - X insertion. For reference column, 'X.' means
+                 insertion of 'X' in the reference. For query column,
+                 'X.' means insertion of 'X' in the query. Nucmer will
+                 align through group insertions of up to ~60 bases.
+                 Each base of these group insertions will be reported
+                 in this count. Large insertions will be reported in
+                 the "Insertions" count about.
+
+TotalGIndels   - Single Nucleotide Insertions/Deleltions bounded by 20
+                 exact, base-pair matches on both sides.