jpayne@69
|
1 --------------------------------------------------------------------------------
|
jpayne@69
|
2 dnadiff is a wrapper for nucmer and analysis utilities that provides
|
jpayne@69
|
3 detailed information on the differences between two genomes, and also
|
jpayne@69
|
4 provides a high level report file that quantifies the differences
|
jpayne@69
|
5 between the two inputs.
|
jpayne@69
|
6
|
jpayne@69
|
7 Use Cases:
|
jpayne@69
|
8 + diff'ing two strains of the same species
|
jpayne@69
|
9 + diff'ing two assemblies of the same organism
|
jpayne@69
|
10 + diff'ing a draft assembly and a closely related finished genome
|
jpayne@69
|
11
|
jpayne@69
|
12 If any of this code is used in any publication, please cite the following:
|
jpayne@69
|
13
|
jpayne@69
|
14 Versatile and open software for comparing large genomes.
|
jpayne@69
|
15 S. Kurtz, A. Phillippy, A.L. Delcher,
|
jpayne@69
|
16 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
|
jpayne@69
|
17 Genome Biology (2004), 5:R12.
|
jpayne@69
|
18
|
jpayne@69
|
19 --------------------------------------------------------------------------------
|
jpayne@69
|
20
|
jpayne@69
|
21 This manual is also available as HTML documentation included in this
|
jpayne@69
|
22 distribution, or at:
|
jpayne@69
|
23
|
jpayne@69
|
24 http://mummer.sourceforge.net
|
jpayne@69
|
25 http://mummer.sourceforge.net/manual
|
jpayne@69
|
26 http://mummer.sourceforge.net/examples
|
jpayne@69
|
27
|
jpayne@69
|
28
|
jpayne@69
|
29 -- DESCRIPTION --
|
jpayne@69
|
30 dnadiff is a wrapper around nucmer that builds an alignment using
|
jpayne@69
|
31 default parameters, and runs many of nucmer's helper scripts to
|
jpayne@69
|
32 process the output and report alignment statistics, SNPs, breakpoints,
|
jpayne@69
|
33 etc. It is designed for evaluating the sequence and structural
|
jpayne@69
|
34 similarity of two highly similar sequence sets. E.g. comparing two
|
jpayne@69
|
35 different assemblies of the same organism, or comparing two strains of
|
jpayne@69
|
36 the same species.
|
jpayne@69
|
37
|
jpayne@69
|
38
|
jpayne@69
|
39 -- dnadiff EXAMPLE --
|
jpayne@69
|
40 To compare two strains of the same species, type:
|
jpayne@69
|
41
|
jpayne@69
|
42 "dnadiff genome1.fna genome2.fna"
|
jpayne@69
|
43
|
jpayne@69
|
44 Output will be...
|
jpayne@69
|
45 out.report - Summary of alignments, differences and SNPs
|
jpayne@69
|
46 out.delta - Standard nucmer alignment output
|
jpayne@69
|
47 out.1delta - 1-to-1 alignment from delta-filter -1
|
jpayne@69
|
48 out.mdelta - M-to-M alignment from delta-filter -m
|
jpayne@69
|
49 out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
|
jpayne@69
|
50 out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
|
jpayne@69
|
51 out.snps - SNPs from show-snps -rlTHC .1delta
|
jpayne@69
|
52 out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta
|
jpayne@69
|
53 out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta
|
jpayne@69
|
54 out.unref - Unaligned reference sequence IDs and lengths
|
jpayne@69
|
55 out.unqry - Unaligned query sequence IDs and lengths
|
jpayne@69
|
56
|
jpayne@69
|
57 For more information on the formats and meanings of all the files
|
jpayne@69
|
58 produced, please see the documentation for the corresponding
|
jpayne@69
|
59 utility. This document serves to describe running the dnadiff script
|
jpayne@69
|
60 and interpreting the produced .report file.
|
jpayne@69
|
61
|
jpayne@69
|
62
|
jpayne@69
|
63 -- RUNNING 'dnadiff' --
|
jpayne@69
|
64
|
jpayne@69
|
65 USAGE: dnadiff [options] <Reference> <Query>
|
jpayne@69
|
66 or dnadiff [options] -d <Delta File>
|
jpayne@69
|
67
|
jpayne@69
|
68 DESCRIPTION:
|
jpayne@69
|
69 Run comparative analysis of two sequence sets using nucmer and its
|
jpayne@69
|
70 associated utilities with recommended parameters. See MUMmer
|
jpayne@69
|
71 documentation for a more detailed description of the
|
jpayne@69
|
72 output. Produces the following output files:
|
jpayne@69
|
73
|
jpayne@69
|
74 .delta - Standard nucmer alignment output
|
jpayne@69
|
75 .1delta - 1-to-1 alignment from delta-filter -1
|
jpayne@69
|
76 .mdelta - M-to-M alignment from delta-filter -m
|
jpayne@69
|
77 .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
|
jpayne@69
|
78 .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
|
jpayne@69
|
79 .snps - SNPs from show-snps -rlTHC .1delta
|
jpayne@69
|
80 .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta
|
jpayne@69
|
81 .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta
|
jpayne@69
|
82 .report - Summary of alignments, differences and SNPs
|
jpayne@69
|
83 .unref - Unaligned reference sequence IDs and lengths
|
jpayne@69
|
84 .unqry - Unaligned query sequence IDs and lengths
|
jpayne@69
|
85
|
jpayne@69
|
86 MANDATORY:
|
jpayne@69
|
87 Reference Set the input reference multi-FASTA filename
|
jpayne@69
|
88 Query Set the input query multi-FASTA filename
|
jpayne@69
|
89 or
|
jpayne@69
|
90 Delta File Unfiltered .delta alignment file from nucmer
|
jpayne@69
|
91
|
jpayne@69
|
92 OPTIONS:
|
jpayne@69
|
93 -d|delta Provide precomputed delta file for analysis
|
jpayne@69
|
94 -h
|
jpayne@69
|
95 --help Display help information and exit
|
jpayne@69
|
96 -p|prefix Set the prefix of the output files (default "out")
|
jpayne@69
|
97 -V
|
jpayne@69
|
98 --version Display the version information and exit
|
jpayne@69
|
99
|
jpayne@69
|
100
|
jpayne@69
|
101 -- NOTES --
|
jpayne@69
|
102 The -p option is recommended to avoid overwriting previous
|
jpayne@69
|
103 output. A simple naming convention is for files A.fna and B.fna, to
|
jpayne@69
|
104 set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
|
jpayne@69
|
105 avoid using the -d option unless the delta file was already generated
|
jpayne@69
|
106 with "nucmer --maxmatch" and has not been filtered.
|
jpayne@69
|
107
|
jpayne@69
|
108
|
jpayne@69
|
109 -- OUTPUT FILES --
|
jpayne@69
|
110 dnadiff produces many outputs, however all but one are produced by
|
jpayne@69
|
111 other utilities in the MUMmer package. Please see their corresponding
|
jpayne@69
|
112 documentation for more information. This section will only describe
|
jpayne@69
|
113 the .report file generated by dnadiff and tips on interpreting it.
|
jpayne@69
|
114
|
jpayne@69
|
115
|
jpayne@69
|
116 *** .report OUTPUT ***
|
jpayne@69
|
117
|
jpayne@69
|
118 Report statistics are broken into two columns - reference and
|
jpayne@69
|
119 query. Rows are grouped by themed alignment metrics and are described
|
jpayne@69
|
120 here. Summary counts are estimates and do not represent the exact
|
jpayne@69
|
121 number of occurrences of a particular evolutionary event. When reading
|
jpayne@69
|
122 a reference column, think number of XYZ in reference with regard to
|
jpayne@69
|
123 the query. When reading a query column, think number of XYZ in query
|
jpayne@69
|
124 with regard to the reference.
|
jpayne@69
|
125
|
jpayne@69
|
126 [Sequences] - Sequence-centric stats.
|
jpayne@69
|
127 TotalSeqs - Total number of input sequences.
|
jpayne@69
|
128 AlignedSeqs - Number of input sequences with at least one alignment.
|
jpayne@69
|
129 UnalignedSeqs - Number of input sequences with no alignment.
|
jpayne@69
|
130
|
jpayne@69
|
131 [Bases] - Base-pair-centric stats.
|
jpayne@69
|
132 TotalBases - Total number of bases in the input sequences.
|
jpayne@69
|
133 AlignedBases - Total number of bases contained within an alignment.
|
jpayne@69
|
134 UnalignedBases - Total number of unaligned bases. This is a rough
|
jpayne@69
|
135 measure for the amount of "unique" sequence in the
|
jpayne@69
|
136 reference and query.
|
jpayne@69
|
137
|
jpayne@69
|
138 [Alignments] - Alignment-centric stats.
|
jpayne@69
|
139 1-to-1 - Number of alignment blocks comprising the 1-to-1
|
jpayne@69
|
140 mapping of reference to query. This is a subset of
|
jpayne@69
|
141 the M-to-M mapping, with repeats removed.
|
jpayne@69
|
142 TotalLength - Total length of 1-to-1 alignment blocks.
|
jpayne@69
|
143 AvgLength - Average length of 1-to-1 alignment blocks.
|
jpayne@69
|
144 AvgIdentity - Average identity of 1-to-1 alignment blocks.
|
jpayne@69
|
145
|
jpayne@69
|
146 M-to-M - Number of alignment blocks comprising the
|
jpayne@69
|
147 many-to-many mapping of reference to query. The
|
jpayne@69
|
148 M-to-M mapping represents the smallest set of
|
jpayne@69
|
149 alignments that maximize the coverage of both
|
jpayne@69
|
150 reference and query. This is a superset of the 1-to-1
|
jpayne@69
|
151 mapping.
|
jpayne@69
|
152 TotalLength - Total length of M-to-M alignment blocks.
|
jpayne@69
|
153 AvgLength - Average length of M-to-M alignment blocks.
|
jpayne@69
|
154 AvgIdentity - Average identity of M-to-M alignment blocks.
|
jpayne@69
|
155
|
jpayne@69
|
156 [Features] - Structural alignment features, such as
|
jpayne@69
|
157 rearrangements. These counts are rough estimates
|
jpayne@69
|
158 based on an automated analysis of the
|
jpayne@69
|
159 alignments. Features are identified by scanning the
|
jpayne@69
|
160 reference (or query) from low to high, and noting the
|
jpayne@69
|
161 positions where the query alignments are
|
jpayne@69
|
162 inconsistently ordered or oriented with respect to
|
jpayne@69
|
163 the reference.
|
jpayne@69
|
164 Breakpoints - Number of non-maximal alignment endpoints,
|
jpayne@69
|
165 i.e. endpoints that do not occur at the beginning or
|
jpayne@69
|
166 end of a sequence.
|
jpayne@69
|
167 Relocations - Number of breaks in the alignment where adjacent
|
jpayne@69
|
168 1-to-1 alignment blocks are in the same sequence, but
|
jpayne@69
|
169 not consistently ordered. A separate feature is
|
jpayne@69
|
170 recorded for each end of a relocation, so this is
|
jpayne@69
|
171 really a count of relocation endpoints.
|
jpayne@69
|
172 Translocations - Number of breaks in the alignment where adjacent
|
jpayne@69
|
173 1-to-1 alignment blocks are in different sequences. A
|
jpayne@69
|
174 separate feature is recorded for each end of a
|
jpayne@69
|
175 translocation, so this is really a count of
|
jpayne@69
|
176 translocation endpoints.
|
jpayne@69
|
177 Inversions - Number of breaks in the alignment where adjacent
|
jpayne@69
|
178 1-to-1 alignment blocks are inverted with respect to
|
jpayne@69
|
179 one another. A separate feature is recorded for each
|
jpayne@69
|
180 end of an inversion, so this is really a count of
|
jpayne@69
|
181 inversion endpoints.
|
jpayne@69
|
182
|
jpayne@69
|
183 Insertions - Rough count of insertion events. Note that this is
|
jpayne@69
|
184 slightly different from "UnalignedBases" because it
|
jpayne@69
|
185 counts duplications as insertions, whereas
|
jpayne@69
|
186 UnalignedBases does not. Also, this count does not
|
jpayne@69
|
187 included sequences that have no alignments as
|
jpayne@69
|
188 insertions, whereas UnalignedBases does. Note than
|
jpayne@69
|
189 insertions in R can be viewed as deletions from Q.
|
jpayne@69
|
190 This number reports only "major" insertions defined
|
jpayne@69
|
191 as insertions large enough to break an alignment.
|
jpayne@69
|
192 Nucmer will align through smaller insertions of less
|
jpayne@69
|
193 than ~60 bases. These smaller insertions are
|
jpayne@69
|
194 reported in the "Indels" count below.
|
jpayne@69
|
195 InsertionSum - Rough sum of inserted sequence.
|
jpayne@69
|
196 InsertionAvg - Average length of insertion.
|
jpayne@69
|
197
|
jpayne@69
|
198 TandemIns - Rough count of tandem duplication insertion
|
jpayne@69
|
199 events. Note that expansions in R can be viewed as
|
jpayne@69
|
200 collapses in Q.
|
jpayne@69
|
201 TandemInsSum - Rough sum of tandem duplication insertions.
|
jpayne@69
|
202 TandemInsAvg - Average length of tandem duplications.
|
jpayne@69
|
203
|
jpayne@69
|
204 [SNPs] - Single Nucleotide Polymorphism counts.
|
jpayne@69
|
205 TotalSNPs - Total number of SNPs, same for both sequences.
|
jpayne@69
|
206 XY - X-to-Y SNP. For reference column, this means
|
jpayne@69
|
207 reference 'X' to query 'Y'. For query column, this
|
jpayne@69
|
208 means query 'X' to reference 'Y'. The same
|
jpayne@69
|
209 convention applies below.
|
jpayne@69
|
210
|
jpayne@69
|
211 TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact,
|
jpayne@69
|
212 base-pair matches on both sides.
|
jpayne@69
|
213
|
jpayne@69
|
214 TotalIndels - Single Nucleotide Insertions/Deleltions.
|
jpayne@69
|
215 X. - X insertion. For reference column, 'X.' means
|
jpayne@69
|
216 insertion of 'X' in the reference. For query column,
|
jpayne@69
|
217 'X.' means insertion of 'X' in the query. Nucmer will
|
jpayne@69
|
218 align through group insertions of up to ~60 bases.
|
jpayne@69
|
219 Each base of these group insertions will be reported
|
jpayne@69
|
220 in this count. Large insertions will be reported in
|
jpayne@69
|
221 the "Insertions" count about.
|
jpayne@69
|
222
|
jpayne@69
|
223 TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20
|
jpayne@69
|
224 exact, base-pair matches on both sides.
|