comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 --------------------------------------------------------------------------------
2 dnadiff is a wrapper for nucmer and analysis utilities that provides
3 detailed information on the differences between two genomes, and also
4 provides a high level report file that quantifies the differences
5 between the two inputs.
6
7 Use Cases:
8 + diff'ing two strains of the same species
9 + diff'ing two assemblies of the same organism
10 + diff'ing a draft assembly and a closely related finished genome
11
12 If any of this code is used in any publication, please cite the following:
13
14 Versatile and open software for comparing large genomes.
15 S. Kurtz, A. Phillippy, A.L. Delcher,
16 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
17 Genome Biology (2004), 5:R12.
18
19 --------------------------------------------------------------------------------
20
21 This manual is also available as HTML documentation included in this
22 distribution, or at:
23
24 http://mummer.sourceforge.net
25 http://mummer.sourceforge.net/manual
26 http://mummer.sourceforge.net/examples
27
28
29 -- DESCRIPTION --
30 dnadiff is a wrapper around nucmer that builds an alignment using
31 default parameters, and runs many of nucmer's helper scripts to
32 process the output and report alignment statistics, SNPs, breakpoints,
33 etc. It is designed for evaluating the sequence and structural
34 similarity of two highly similar sequence sets. E.g. comparing two
35 different assemblies of the same organism, or comparing two strains of
36 the same species.
37
38
39 -- dnadiff EXAMPLE --
40 To compare two strains of the same species, type:
41
42 "dnadiff genome1.fna genome2.fna"
43
44 Output will be...
45 out.report - Summary of alignments, differences and SNPs
46 out.delta - Standard nucmer alignment output
47 out.1delta - 1-to-1 alignment from delta-filter -1
48 out.mdelta - M-to-M alignment from delta-filter -m
49 out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
50 out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
51 out.snps - SNPs from show-snps -rlTHC .1delta
52 out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta
53 out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta
54 out.unref - Unaligned reference sequence IDs and lengths
55 out.unqry - Unaligned query sequence IDs and lengths
56
57 For more information on the formats and meanings of all the files
58 produced, please see the documentation for the corresponding
59 utility. This document serves to describe running the dnadiff script
60 and interpreting the produced .report file.
61
62
63 -- RUNNING 'dnadiff' --
64
65 USAGE: dnadiff [options] <Reference> <Query>
66 or dnadiff [options] -d <Delta File>
67
68 DESCRIPTION:
69 Run comparative analysis of two sequence sets using nucmer and its
70 associated utilities with recommended parameters. See MUMmer
71 documentation for a more detailed description of the
72 output. Produces the following output files:
73
74 .delta - Standard nucmer alignment output
75 .1delta - 1-to-1 alignment from delta-filter -1
76 .mdelta - M-to-M alignment from delta-filter -m
77 .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
78 .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
79 .snps - SNPs from show-snps -rlTHC .1delta
80 .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta
81 .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta
82 .report - Summary of alignments, differences and SNPs
83 .unref - Unaligned reference sequence IDs and lengths
84 .unqry - Unaligned query sequence IDs and lengths
85
86 MANDATORY:
87 Reference Set the input reference multi-FASTA filename
88 Query Set the input query multi-FASTA filename
89 or
90 Delta File Unfiltered .delta alignment file from nucmer
91
92 OPTIONS:
93 -d|delta Provide precomputed delta file for analysis
94 -h
95 --help Display help information and exit
96 -p|prefix Set the prefix of the output files (default "out")
97 -V
98 --version Display the version information and exit
99
100
101 -- NOTES --
102 The -p option is recommended to avoid overwriting previous
103 output. A simple naming convention is for files A.fna and B.fna, to
104 set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
105 avoid using the -d option unless the delta file was already generated
106 with "nucmer --maxmatch" and has not been filtered.
107
108
109 -- OUTPUT FILES --
110 dnadiff produces many outputs, however all but one are produced by
111 other utilities in the MUMmer package. Please see their corresponding
112 documentation for more information. This section will only describe
113 the .report file generated by dnadiff and tips on interpreting it.
114
115
116 *** .report OUTPUT ***
117
118 Report statistics are broken into two columns - reference and
119 query. Rows are grouped by themed alignment metrics and are described
120 here. Summary counts are estimates and do not represent the exact
121 number of occurrences of a particular evolutionary event. When reading
122 a reference column, think number of XYZ in reference with regard to
123 the query. When reading a query column, think number of XYZ in query
124 with regard to the reference.
125
126 [Sequences] - Sequence-centric stats.
127 TotalSeqs - Total number of input sequences.
128 AlignedSeqs - Number of input sequences with at least one alignment.
129 UnalignedSeqs - Number of input sequences with no alignment.
130
131 [Bases] - Base-pair-centric stats.
132 TotalBases - Total number of bases in the input sequences.
133 AlignedBases - Total number of bases contained within an alignment.
134 UnalignedBases - Total number of unaligned bases. This is a rough
135 measure for the amount of "unique" sequence in the
136 reference and query.
137
138 [Alignments] - Alignment-centric stats.
139 1-to-1 - Number of alignment blocks comprising the 1-to-1
140 mapping of reference to query. This is a subset of
141 the M-to-M mapping, with repeats removed.
142 TotalLength - Total length of 1-to-1 alignment blocks.
143 AvgLength - Average length of 1-to-1 alignment blocks.
144 AvgIdentity - Average identity of 1-to-1 alignment blocks.
145
146 M-to-M - Number of alignment blocks comprising the
147 many-to-many mapping of reference to query. The
148 M-to-M mapping represents the smallest set of
149 alignments that maximize the coverage of both
150 reference and query. This is a superset of the 1-to-1
151 mapping.
152 TotalLength - Total length of M-to-M alignment blocks.
153 AvgLength - Average length of M-to-M alignment blocks.
154 AvgIdentity - Average identity of M-to-M alignment blocks.
155
156 [Features] - Structural alignment features, such as
157 rearrangements. These counts are rough estimates
158 based on an automated analysis of the
159 alignments. Features are identified by scanning the
160 reference (or query) from low to high, and noting the
161 positions where the query alignments are
162 inconsistently ordered or oriented with respect to
163 the reference.
164 Breakpoints - Number of non-maximal alignment endpoints,
165 i.e. endpoints that do not occur at the beginning or
166 end of a sequence.
167 Relocations - Number of breaks in the alignment where adjacent
168 1-to-1 alignment blocks are in the same sequence, but
169 not consistently ordered. A separate feature is
170 recorded for each end of a relocation, so this is
171 really a count of relocation endpoints.
172 Translocations - Number of breaks in the alignment where adjacent
173 1-to-1 alignment blocks are in different sequences. A
174 separate feature is recorded for each end of a
175 translocation, so this is really a count of
176 translocation endpoints.
177 Inversions - Number of breaks in the alignment where adjacent
178 1-to-1 alignment blocks are inverted with respect to
179 one another. A separate feature is recorded for each
180 end of an inversion, so this is really a count of
181 inversion endpoints.
182
183 Insertions - Rough count of insertion events. Note that this is
184 slightly different from "UnalignedBases" because it
185 counts duplications as insertions, whereas
186 UnalignedBases does not. Also, this count does not
187 included sequences that have no alignments as
188 insertions, whereas UnalignedBases does. Note than
189 insertions in R can be viewed as deletions from Q.
190 This number reports only "major" insertions defined
191 as insertions large enough to break an alignment.
192 Nucmer will align through smaller insertions of less
193 than ~60 bases. These smaller insertions are
194 reported in the "Indels" count below.
195 InsertionSum - Rough sum of inserted sequence.
196 InsertionAvg - Average length of insertion.
197
198 TandemIns - Rough count of tandem duplication insertion
199 events. Note that expansions in R can be viewed as
200 collapses in Q.
201 TandemInsSum - Rough sum of tandem duplication insertions.
202 TandemInsAvg - Average length of tandem duplications.
203
204 [SNPs] - Single Nucleotide Polymorphism counts.
205 TotalSNPs - Total number of SNPs, same for both sequences.
206 XY - X-to-Y SNP. For reference column, this means
207 reference 'X' to query 'Y'. For query column, this
208 means query 'X' to reference 'Y'. The same
209 convention applies below.
210
211 TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact,
212 base-pair matches on both sides.
213
214 TotalIndels - Single Nucleotide Insertions/Deleltions.
215 X. - X insertion. For reference column, 'X.' means
216 insertion of 'X' in the reference. For query column,
217 'X.' means insertion of 'X' in the query. Nucmer will
218 align through group insertions of up to ~60 bases.
219 Each base of these group insertions will be reported
220 in this count. Large insertions will be reported in
221 the "Insertions" count about.
222
223 TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20
224 exact, base-pair matches on both sides.