Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 69:33d812a61356 |
---|---|
1 -------------------------------------------------------------------------------- | |
2 dnadiff is a wrapper for nucmer and analysis utilities that provides | |
3 detailed information on the differences between two genomes, and also | |
4 provides a high level report file that quantifies the differences | |
5 between the two inputs. | |
6 | |
7 Use Cases: | |
8 + diff'ing two strains of the same species | |
9 + diff'ing two assemblies of the same organism | |
10 + diff'ing a draft assembly and a closely related finished genome | |
11 | |
12 If any of this code is used in any publication, please cite the following: | |
13 | |
14 Versatile and open software for comparing large genomes. | |
15 S. Kurtz, A. Phillippy, A.L. Delcher, | |
16 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. | |
17 Genome Biology (2004), 5:R12. | |
18 | |
19 -------------------------------------------------------------------------------- | |
20 | |
21 This manual is also available as HTML documentation included in this | |
22 distribution, or at: | |
23 | |
24 http://mummer.sourceforge.net | |
25 http://mummer.sourceforge.net/manual | |
26 http://mummer.sourceforge.net/examples | |
27 | |
28 | |
29 -- DESCRIPTION -- | |
30 dnadiff is a wrapper around nucmer that builds an alignment using | |
31 default parameters, and runs many of nucmer's helper scripts to | |
32 process the output and report alignment statistics, SNPs, breakpoints, | |
33 etc. It is designed for evaluating the sequence and structural | |
34 similarity of two highly similar sequence sets. E.g. comparing two | |
35 different assemblies of the same organism, or comparing two strains of | |
36 the same species. | |
37 | |
38 | |
39 -- dnadiff EXAMPLE -- | |
40 To compare two strains of the same species, type: | |
41 | |
42 "dnadiff genome1.fna genome2.fna" | |
43 | |
44 Output will be... | |
45 out.report - Summary of alignments, differences and SNPs | |
46 out.delta - Standard nucmer alignment output | |
47 out.1delta - 1-to-1 alignment from delta-filter -1 | |
48 out.mdelta - M-to-M alignment from delta-filter -m | |
49 out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta | |
50 out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta | |
51 out.snps - SNPs from show-snps -rlTHC .1delta | |
52 out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta | |
53 out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta | |
54 out.unref - Unaligned reference sequence IDs and lengths | |
55 out.unqry - Unaligned query sequence IDs and lengths | |
56 | |
57 For more information on the formats and meanings of all the files | |
58 produced, please see the documentation for the corresponding | |
59 utility. This document serves to describe running the dnadiff script | |
60 and interpreting the produced .report file. | |
61 | |
62 | |
63 -- RUNNING 'dnadiff' -- | |
64 | |
65 USAGE: dnadiff [options] <Reference> <Query> | |
66 or dnadiff [options] -d <Delta File> | |
67 | |
68 DESCRIPTION: | |
69 Run comparative analysis of two sequence sets using nucmer and its | |
70 associated utilities with recommended parameters. See MUMmer | |
71 documentation for a more detailed description of the | |
72 output. Produces the following output files: | |
73 | |
74 .delta - Standard nucmer alignment output | |
75 .1delta - 1-to-1 alignment from delta-filter -1 | |
76 .mdelta - M-to-M alignment from delta-filter -m | |
77 .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta | |
78 .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta | |
79 .snps - SNPs from show-snps -rlTHC .1delta | |
80 .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta | |
81 .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta | |
82 .report - Summary of alignments, differences and SNPs | |
83 .unref - Unaligned reference sequence IDs and lengths | |
84 .unqry - Unaligned query sequence IDs and lengths | |
85 | |
86 MANDATORY: | |
87 Reference Set the input reference multi-FASTA filename | |
88 Query Set the input query multi-FASTA filename | |
89 or | |
90 Delta File Unfiltered .delta alignment file from nucmer | |
91 | |
92 OPTIONS: | |
93 -d|delta Provide precomputed delta file for analysis | |
94 -h | |
95 --help Display help information and exit | |
96 -p|prefix Set the prefix of the output files (default "out") | |
97 -V | |
98 --version Display the version information and exit | |
99 | |
100 | |
101 -- NOTES -- | |
102 The -p option is recommended to avoid overwriting previous | |
103 output. A simple naming convention is for files A.fna and B.fna, to | |
104 set "-p A_B". It is safest to let dnadiff run nucmer automatically, so | |
105 avoid using the -d option unless the delta file was already generated | |
106 with "nucmer --maxmatch" and has not been filtered. | |
107 | |
108 | |
109 -- OUTPUT FILES -- | |
110 dnadiff produces many outputs, however all but one are produced by | |
111 other utilities in the MUMmer package. Please see their corresponding | |
112 documentation for more information. This section will only describe | |
113 the .report file generated by dnadiff and tips on interpreting it. | |
114 | |
115 | |
116 *** .report OUTPUT *** | |
117 | |
118 Report statistics are broken into two columns - reference and | |
119 query. Rows are grouped by themed alignment metrics and are described | |
120 here. Summary counts are estimates and do not represent the exact | |
121 number of occurrences of a particular evolutionary event. When reading | |
122 a reference column, think number of XYZ in reference with regard to | |
123 the query. When reading a query column, think number of XYZ in query | |
124 with regard to the reference. | |
125 | |
126 [Sequences] - Sequence-centric stats. | |
127 TotalSeqs - Total number of input sequences. | |
128 AlignedSeqs - Number of input sequences with at least one alignment. | |
129 UnalignedSeqs - Number of input sequences with no alignment. | |
130 | |
131 [Bases] - Base-pair-centric stats. | |
132 TotalBases - Total number of bases in the input sequences. | |
133 AlignedBases - Total number of bases contained within an alignment. | |
134 UnalignedBases - Total number of unaligned bases. This is a rough | |
135 measure for the amount of "unique" sequence in the | |
136 reference and query. | |
137 | |
138 [Alignments] - Alignment-centric stats. | |
139 1-to-1 - Number of alignment blocks comprising the 1-to-1 | |
140 mapping of reference to query. This is a subset of | |
141 the M-to-M mapping, with repeats removed. | |
142 TotalLength - Total length of 1-to-1 alignment blocks. | |
143 AvgLength - Average length of 1-to-1 alignment blocks. | |
144 AvgIdentity - Average identity of 1-to-1 alignment blocks. | |
145 | |
146 M-to-M - Number of alignment blocks comprising the | |
147 many-to-many mapping of reference to query. The | |
148 M-to-M mapping represents the smallest set of | |
149 alignments that maximize the coverage of both | |
150 reference and query. This is a superset of the 1-to-1 | |
151 mapping. | |
152 TotalLength - Total length of M-to-M alignment blocks. | |
153 AvgLength - Average length of M-to-M alignment blocks. | |
154 AvgIdentity - Average identity of M-to-M alignment blocks. | |
155 | |
156 [Features] - Structural alignment features, such as | |
157 rearrangements. These counts are rough estimates | |
158 based on an automated analysis of the | |
159 alignments. Features are identified by scanning the | |
160 reference (or query) from low to high, and noting the | |
161 positions where the query alignments are | |
162 inconsistently ordered or oriented with respect to | |
163 the reference. | |
164 Breakpoints - Number of non-maximal alignment endpoints, | |
165 i.e. endpoints that do not occur at the beginning or | |
166 end of a sequence. | |
167 Relocations - Number of breaks in the alignment where adjacent | |
168 1-to-1 alignment blocks are in the same sequence, but | |
169 not consistently ordered. A separate feature is | |
170 recorded for each end of a relocation, so this is | |
171 really a count of relocation endpoints. | |
172 Translocations - Number of breaks in the alignment where adjacent | |
173 1-to-1 alignment blocks are in different sequences. A | |
174 separate feature is recorded for each end of a | |
175 translocation, so this is really a count of | |
176 translocation endpoints. | |
177 Inversions - Number of breaks in the alignment where adjacent | |
178 1-to-1 alignment blocks are inverted with respect to | |
179 one another. A separate feature is recorded for each | |
180 end of an inversion, so this is really a count of | |
181 inversion endpoints. | |
182 | |
183 Insertions - Rough count of insertion events. Note that this is | |
184 slightly different from "UnalignedBases" because it | |
185 counts duplications as insertions, whereas | |
186 UnalignedBases does not. Also, this count does not | |
187 included sequences that have no alignments as | |
188 insertions, whereas UnalignedBases does. Note than | |
189 insertions in R can be viewed as deletions from Q. | |
190 This number reports only "major" insertions defined | |
191 as insertions large enough to break an alignment. | |
192 Nucmer will align through smaller insertions of less | |
193 than ~60 bases. These smaller insertions are | |
194 reported in the "Indels" count below. | |
195 InsertionSum - Rough sum of inserted sequence. | |
196 InsertionAvg - Average length of insertion. | |
197 | |
198 TandemIns - Rough count of tandem duplication insertion | |
199 events. Note that expansions in R can be viewed as | |
200 collapses in Q. | |
201 TandemInsSum - Rough sum of tandem duplication insertions. | |
202 TandemInsAvg - Average length of tandem duplications. | |
203 | |
204 [SNPs] - Single Nucleotide Polymorphism counts. | |
205 TotalSNPs - Total number of SNPs, same for both sequences. | |
206 XY - X-to-Y SNP. For reference column, this means | |
207 reference 'X' to query 'Y'. For query column, this | |
208 means query 'X' to reference 'Y'. The same | |
209 convention applies below. | |
210 | |
211 TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact, | |
212 base-pair matches on both sides. | |
213 | |
214 TotalIndels - Single Nucleotide Insertions/Deleltions. | |
215 X. - X insertion. For reference column, 'X.' means | |
216 insertion of 'X' in the reference. For query column, | |
217 'X.' means insertion of 'X' in the query. Nucmer will | |
218 align through group insertions of up to ~60 bases. | |
219 Each base of these group insertions will be reported | |
220 in this count. Large insertions will be reported in | |
221 the "Insertions" count about. | |
222 | |
223 TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20 | |
224 exact, base-pair matches on both sides. |