jpayne@69
|
1 --------------------------------------------------------------------------------
|
jpayne@69
|
2 NUCmer3.0:
|
jpayne@69
|
3 An extension of the MUMmer package that calculates alignments
|
jpayne@69
|
4 between two DNA multi-fasta files using the raw DNA sequence.
|
jpayne@69
|
5
|
jpayne@69
|
6 Use Cases:
|
jpayne@69
|
7 + aligning two unfinished shotgun sequencing assemblies
|
jpayne@69
|
8 + aligning an unfinished sequencing assembly to a finished genome
|
jpayne@69
|
9 + comparing two fairly similar genomes that have large rearrangements
|
jpayne@69
|
10
|
jpayne@69
|
11 If any of this code is used in any publication, please cite the following:
|
jpayne@69
|
12
|
jpayne@69
|
13 Versatile and open software for comparing large genomes.
|
jpayne@69
|
14 S. Kurtz, A. Phillippy, A.L. Delcher,
|
jpayne@69
|
15 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
|
jpayne@69
|
16 Genome Biology (2004), 5:R12.
|
jpayne@69
|
17
|
jpayne@69
|
18 --------------------------------------------------------------------------------
|
jpayne@69
|
19
|
jpayne@69
|
20 ** NOTE **
|
jpayne@69
|
21 This manual is outdated, please refer to the HTML documentation included in
|
jpayne@69
|
22 this distribution or at:
|
jpayne@69
|
23
|
jpayne@69
|
24 http://mummer.sourceforge.net
|
jpayne@69
|
25 http://mummer.sourceforge.net/manual
|
jpayne@69
|
26 http://mummer.sourceforge.net/examples
|
jpayne@69
|
27
|
jpayne@69
|
28 -- DESCRIPTION --
|
jpayne@69
|
29 NUCmer3.0 (NUCleotide MUMmer) is a suite of programs to modify and refine
|
jpayne@69
|
30 the basic output of the MUMmer3.0 matching program 'mummer'. NUCmer pre-
|
jpayne@69
|
31 processes the DNA multi-FASTA input files so that they can be examined by the
|
jpayne@69
|
32 match finding routine. After which, the matches are clustered and the matches
|
jpayne@69
|
33 within clusters are extended via Smith-Waterman techniques in order to expand
|
jpayne@69
|
34 the total alignment coverage and close the gaps between clustered MUMs. The
|
jpayne@69
|
35 "out.delta" output file contains the final alignment data, encoded
|
jpayne@69
|
36 with a style called delta encoding. Any of the 'show-*' programs are able to
|
jpayne@69
|
37 parse this file and present its information in a human readable format.
|
jpayne@69
|
38
|
jpayne@69
|
39
|
jpayne@69
|
40 -- NUCmer3.0 EXAMPLE --
|
jpayne@69
|
41 To compare a set of assembly contigs "asmbl.fasta" to an already completed,
|
jpayne@69
|
42 related genome "genome.fasta" type:
|
jpayne@69
|
43
|
jpayne@69
|
44 "nucmer -o -p output genome.fasta asmbl.fasta"
|
jpayne@69
|
45
|
jpayne@69
|
46 Output will be...
|
jpayne@69
|
47 output.delta // alignment data encoded with delta encoding
|
jpayne@69
|
48 output.coords // list of alignments, % identity, etc...
|
jpayne@69
|
49
|
jpayne@69
|
50 To generate more output, investigate the options of any of the 'show-*'
|
jpayne@69
|
51 programs, these programs can interpret the .delta output of NUCmer and provide
|
jpayne@69
|
52 useful information regarding the alignment. In addition, dotplots can be
|
jpayne@69
|
53 generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
|
jpayne@69
|
54 the 'delta-filter' utility is very useful for removing chance and repeat-induced
|
jpayne@69
|
55 alignments. It can significantly reduce the number of alignments in the nucmer
|
jpayne@69
|
56 output, making it easier to interpret (see html manual for more information).
|
jpayne@69
|
57
|
jpayne@69
|
58
|
jpayne@69
|
59 -- RUNNING 'nucmer' --
|
jpayne@69
|
60
|
jpayne@69
|
61 USAGE: nucmer [options] <Reference> <Query>
|
jpayne@69
|
62
|
jpayne@69
|
63
|
jpayne@69
|
64 MANDATORY:
|
jpayne@69
|
65 Reference Set the input reference multi-FASTA file to "Reference"
|
jpayne@69
|
66 Query Set the input query multi-FASTA file to "Query"
|
jpayne@69
|
67
|
jpayne@69
|
68
|
jpayne@69
|
69 OPTIONS:
|
jpayne@69
|
70 --mum Use only maximal exact matches that are unique in both the
|
jpayne@69
|
71 query and reference sequences as the alignment anchors.
|
jpayne@69
|
72
|
jpayne@69
|
73 --mumreference Use only maximal exact matches that are unique in the
|
jpayne@69
|
74 reference sequences as the alignment anchors.
|
jpayne@69
|
75
|
jpayne@69
|
76 --maxmatch Use all maximal exact matches as the alignment anchors.
|
jpayne@69
|
77
|
jpayne@69
|
78 -b|breakLen Set the distance an alignment extension will attempt to
|
jpayne@69
|
79 extend poor scoring regions before giving up. The default
|
jpayne@69
|
80 distance is 200. This distance should be measured in DNA
|
jpayne@69
|
81 bases, and it effects the tolerance to error of the
|
jpayne@69
|
82 alignment extensions. A higher value will result in greater
|
jpayne@69
|
83 tolerance to error in hopes of finding good alignments on
|
jpayne@69
|
84 the other side of a poorly scoring region.
|
jpayne@69
|
85
|
jpayne@69
|
86 -c|mincluster Sets the minimum length of a cluster. The default value is
|
jpayne@69
|
87 65. The length of a match cluster is determined by the sum
|
jpayne@69
|
88 of the lengths of the matches within. A higher value will
|
jpayne@69
|
89 decrease the sensitivity of the alignment, but will also
|
jpayne@69
|
90 result in more confident results.
|
jpayne@69
|
91
|
jpayne@69
|
92 --[no]delta Toggles the creation of the delta file. The default
|
jpayne@69
|
93 behavior is --delta, but disabling the delta file will
|
jpayne@69
|
94 speed up the finishing stage by not creating alignments.
|
jpayne@69
|
95 This option implies --noextend.
|
jpayne@69
|
96
|
jpayne@69
|
97 --depend Print the dependency information and exit.
|
jpayne@69
|
98
|
jpayne@69
|
99 -d|diagfactor Set the clustering fraction of separation for diagonal
|
jpayne@69
|
100 difference. The default value is .12. A higher value will
|
jpayne@69
|
101 increase the tolerance of the clustering algorithm and
|
jpayne@69
|
102 allow for more indels in a cluster.
|
jpayne@69
|
103
|
jpayne@69
|
104 --[no]extend Toggles the outward extension of alignments from their
|
jpayne@69
|
105 anchoring clusters. The default behavior is --extend, but
|
jpayne@69
|
106 disabling the extensions will speed up the finishing stage
|
jpayne@69
|
107 by not extending alignments. Clusters will still be fused
|
jpayne@69
|
108 into alignments, but they will not be expanded outward.
|
jpayne@69
|
109
|
jpayne@69
|
110 -f
|
jpayne@69
|
111 --forward Use only the forward strand of the Query sequences. The
|
jpayne@69
|
112 default behavior is to use both the forward and reverse
|
jpayne@69
|
113 strands.
|
jpayne@69
|
114
|
jpayne@69
|
115 -g|maxgap Set the maximum gap between two adjacent matches in a
|
jpayne@69
|
116 cluster. The default value is 90. A smaller value will
|
jpayne@69
|
117 result in smaller (but more) clusters, a larger value will
|
jpayne@69
|
118 result in larger (but fewer) clusters.
|
jpayne@69
|
119
|
jpayne@69
|
120 -h
|
jpayne@69
|
121 --help Display help information and exit.
|
jpayne@69
|
122
|
jpayne@69
|
123 -l|minmatch Set the minimum length of a single match. The default value
|
jpayne@69
|
124 is 20. Reducing this value will possibly increase the
|
jpayne@69
|
125 sensitivity of the alignment, but it will also allow for
|
jpayne@69
|
126 chance or "noise" matches. Take note that lowering this
|
jpayne@69
|
127 value will significantly increase runtime.
|
jpayne@69
|
128
|
jpayne@69
|
129 -o
|
jpayne@69
|
130 --coords Automatically generate the "out.coords" file using the
|
jpayne@69
|
131 'show-coords' program. This file lists all the alignments
|
jpayne@69
|
132 sorted by their reference coordinate in a user friendly
|
jpayne@69
|
133 format, without requiring the user to run 'show-coords'
|
jpayne@69
|
134 independently of nucmer.
|
jpayne@69
|
135
|
jpayne@69
|
136 --[no]optimize Toggle alignment score optimization, i.e. if an alignment
|
jpayne@69
|
137 extension reaches the end of a sequence, it will backtrack
|
jpayne@69
|
138 to optimize the alignment score instead of terminating the
|
jpayne@69
|
139 alignment at the end of the sequence. By turning this
|
jpayne@69
|
140 option off, alignments within -b bases of the sequence end
|
jpayne@69
|
141 will be forced to extend to the end. Default behavior is
|
jpayne@69
|
142 --optimize, --nooptimize will result in longer alignments
|
jpayne@69
|
143 but may lead to lower alignment scores.
|
jpayne@69
|
144
|
jpayne@69
|
145 -p|prefix Set the prefix of the output files. The default prefix is
|
jpayne@69
|
146 "out". Take note that nucmer will allow the user to
|
jpayne@69
|
147 overwrite existing files, so a unique prefix should be used
|
jpayne@69
|
148 for each subsequent run of nucmer to avoid data loss.
|
jpayne@69
|
149
|
jpayne@69
|
150 -r
|
jpayne@69
|
151 --reverse Use only the reverse complement of the Query sequences. The
|
jpayne@69
|
152 default behavior is to use both the forward and reverse
|
jpayne@69
|
153 strands.
|
jpayne@69
|
154
|
jpayne@69
|
155 --[no]simplify Simplify alignments by removing shadowed clusters. This
|
jpayne@69
|
156 is the default behavior, however it can be turned off if a
|
jpayne@69
|
157 sequence is being aligned to itself in order to find inexact
|
jpayne@69
|
158 repeats.
|
jpayne@69
|
159
|
jpayne@69
|
160 -V
|
jpayne@69
|
161 --version Display the version information and exit
|
jpayne@69
|
162
|
jpayne@69
|
163
|
jpayne@69
|
164
|
jpayne@69
|
165 -- NOTES --
|
jpayne@69
|
166 When comparing two entire genomes, it is very helpful to mask the
|
jpayne@69
|
167 "uninteresting" regions of input using a utility such as "nseg" or "dust".
|
jpayne@69
|
168 This will allow the program to focus solely on aligning the regions of
|
jpayne@69
|
169 interest. Since only ACGT's will be matched, any other alpha character used
|
jpayne@69
|
170 to mask the sequence will not be matched.
|
jpayne@69
|
171 Since NUCmer runs so quickly, it can be useful to run it numerous times
|
jpayne@69
|
172 with different parameters to fine-tune the resulting alignment and include or
|
jpayne@69
|
173 exclude missed or chance matches. It is also helpful to try the different
|
jpayne@69
|
174 uniqueness switches to attain the appropriate level of detail in the resulting
|
jpayne@69
|
175 output.
|
jpayne@69
|
176
|
jpayne@69
|
177
|
jpayne@69
|
178
|
jpayne@69
|
179 -- OUTPUT FILES --
|
jpayne@69
|
180
|
jpayne@69
|
181 *** .delta OUTPUT ***
|
jpayne@69
|
182
|
jpayne@69
|
183 This output file is a representation of the all-vs-all alignment between
|
jpayne@69
|
184 the sequences contained in the multi-FASTA input files. It catalogs the
|
jpayne@69
|
185 coordinates of aligned regions and the distance between insertions and deletions
|
jpayne@69
|
186 contained in these alignment regions. The first two lines of the file are
|
jpayne@69
|
187 identical to the .cluster output. The first line lists the two original input
|
jpayne@69
|
188 files separated by a space, and the second line specifies the alignment data
|
jpayne@69
|
189 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
|
jpayne@69
|
190 a header, just like the cluster's header in the .cluster file. This is a FASTA
|
jpayne@69
|
191 style header and lists the two sequences that produced the following alignments
|
jpayne@69
|
192 after a '>' and separated by a space, after the two sequences are the lengths
|
jpayne@69
|
193 of those sequences in the same order. An example header might look like:
|
jpayne@69
|
194
|
jpayne@69
|
195 >tagA1 tagB1 500 2000000
|
jpayne@69
|
196
|
jpayne@69
|
197 Following this sequence header is the alignment data. Each alignment region
|
jpayne@69
|
198 has a header that describes the start and end coordinates of the alignment in
|
jpayne@69
|
199 each sequence. These coordinates are inclusive and reference the forward strand
|
jpayne@69
|
200 of the current sequence. Thus, if the start coordinate is greater than the end
|
jpayne@69
|
201 coordinate, the alignment is on the reverse strand. The four digits are the
|
jpayne@69
|
202 start and end in the reference sequence respectively and the start and end in
|
jpayne@69
|
203 the query sequence respectively. These coordinates are always measured in DNA
|
jpayne@69
|
204 bases regardless of the alignment data type. The three digits after the starts
|
jpayne@69
|
205 and stops are the number of errors (non-identities), similarity errors (non-
|
jpayne@69
|
206 positive match scores) and non-alpha characters in the sequence (used to count
|
jpayne@69
|
207 stop-codons i promer data). An example header might look like:
|
jpayne@69
|
208
|
jpayne@69
|
209 5198 22885 5389 23089 20 20 0
|
jpayne@69
|
210
|
jpayne@69
|
211 Each of these headers is followed by a string of signed digits, one per line,
|
jpayne@69
|
212 with the final line before the next header equaling 0 (zero). Each digit
|
jpayne@69
|
213 represents the distance to the next insertion in the reference (positive int)
|
jpayne@69
|
214 or deletion in the reference (negative int), as measured in DNA bases or amino
|
jpayne@69
|
215 acids depending on the alignment data type. For example, with 'nucmer' the
|
jpayne@69
|
216 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
|
jpayne@69
|
217 in the reference sequence and an insertion at position 3 in the query sequence.
|
jpayne@69
|
218 Or with letters:
|
jpayne@69
|
219
|
jpayne@69
|
220 A = acgtagctgag$
|
jpayne@69
|
221 B = cggtagtgag$
|
jpayne@69
|
222 Delta = (1, -3, 4, 0)
|
jpayne@69
|
223 A = acg.tagctgag$
|
jpayne@69
|
224 B = .cggtag.tgag$
|
jpayne@69
|
225
|
jpayne@69
|
226 Using this delta information, it is possible to re-generate the alignment
|
jpayne@69
|
227 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
|
jpayne@69
|
228 allows various utilities to be crafted to process and analyze the alignment
|
jpayne@69
|
229 data using a universal format. Below is what a .delta file might look like:
|
jpayne@69
|
230
|
jpayne@69
|
231 /home/username/reference.fasta /home/username/query.fasta
|
jpayne@69
|
232 NUCMER
|
jpayne@69
|
233 >tagA1 tagB1 500 2000000
|
jpayne@69
|
234 88 198 1641558 1641668 0 0 0
|
jpayne@69
|
235 0
|
jpayne@69
|
236 167 4877 1 4714 15 15 0
|
jpayne@69
|
237 2456
|
jpayne@69
|
238 1
|
jpayne@69
|
239 -11
|
jpayne@69
|
240 769
|
jpayne@69
|
241 950
|
jpayne@69
|
242 1
|
jpayne@69
|
243 1
|
jpayne@69
|
244 -142
|
jpayne@69
|
245 -1
|
jpayne@69
|
246 0
|
jpayne@69
|
247 >tagA2 tagB4 50000 30000
|
jpayne@69
|
248 5198 22885 5389 23089 18 18 0
|
jpayne@69
|
249 -6
|
jpayne@69
|
250 -32
|
jpayne@69
|
251 -1
|
jpayne@69
|
252 -1
|
jpayne@69
|
253 -1
|
jpayne@69
|
254 7
|
jpayne@69
|
255 1130
|
jpayne@69
|
256 0
|
jpayne@69
|
257
|
jpayne@69
|
258
|
jpayne@69
|
259
|
jpayne@69
|
260 *** .cluster OUTPUT ***
|
jpayne@69
|
261
|
jpayne@69
|
262 This output format is for debugging purposes and is now only available by
|
jpayne@69
|
263 using the -d switch for the 'postnuc' program.
|
jpayne@69
|
264
|
jpayne@69
|
265 This output file is a list of the match clusters that were generated by the
|
jpayne@69
|
266 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
|
jpayne@69
|
267 of the headers to be described later. 2 example rows could read:
|
jpayne@69
|
268
|
jpayne@69
|
269 1788 1622 59 - -
|
jpayne@69
|
270 1857 1691 23 10 10
|
jpayne@69
|
271
|
jpayne@69
|
272 Where the first column is the start coordinate of the match in the reference
|
jpayne@69
|
273 sequence, the second column is the start coordinate of the match in the query
|
jpayne@69
|
274 sequence, the third column is the length of the match, and the two final
|
jpayne@69
|
275 columns are the distance between the previous match's end and the current
|
jpayne@69
|
276 match's start (the gap distance). All coordinates reference the forward strand
|
jpayne@69
|
277 of each sequence, regardless of match direction, and are ALWAYS measured in
|
jpayne@69
|
278 DNA bases regardless of alignment data type (DNA or amino acid).
|
jpayne@69
|
279 Each individual cluster is preceded by two digits (1 or -1). These two
|
jpayne@69
|
280 digits represent the direction of the cluster, either forward or reverse
|
jpayne@69
|
281 complement, in each sequence. A " 1 -1" would represent a match on the forward
|
jpayne@69
|
282 strand of the reference and the reverse strand of the query, while a " 1 1"
|
jpayne@69
|
283 would represent a forward match on each strand. Take note that since the
|
jpayne@69
|
284 match coordinates reference the forward strand, forward matches will have
|
jpayne@69
|
285 ascending matches and a reverse matches will have descending matches. Also,
|
jpayne@69
|
286 since the query is the only sequence every reverse complemented, expect the
|
jpayne@69
|
287 first digit on the cluster header to always be 1.
|
jpayne@69
|
288 There are also 3 other types of headers. The first line of each .cluster
|
jpayne@69
|
289 file lists the two original input files separated by a space. The second line
|
jpayne@69
|
290 of each .cluster file lists the type of alignment data, either "NUCMER" or
|
jpayne@69
|
291 "PROMER". The third type of header resembles a FASTA header, and lists the
|
jpayne@69
|
292 two sequences that produced the following clusters after a '>' and their
|
jpayne@69
|
293 respective lengths separated by a whitespace. Note that each of these headers
|
jpayne@69
|
294 is unique, so all clusters/matches between any two sequences will appear under
|
jpayne@69
|
295 a single header identifying those two sequences. Below is a short example of
|
jpayne@69
|
296 what a .cluster file might look like:
|
jpayne@69
|
297
|
jpayne@69
|
298 /home/username/reference.fasta /home/username/query.fasta
|
jpayne@69
|
299 NUCMER
|
jpayne@69
|
300 >tagA1 tagB1 1000 2000000
|
jpayne@69
|
301 1 1
|
jpayne@69
|
302 88 1641558 111 - -
|
jpayne@69
|
303 1 1
|
jpayne@69
|
304 183 17 22 - -
|
jpayne@69
|
305 238 72 108 33 33
|
jpayne@69
|
306 347 181 92 1 1
|
jpayne@69
|
307 458 292 50 19 19
|
jpayne@69
|
308 509 343 35 1 1
|
jpayne@69
|
309 >tagA2 tagB1 100000 2000000
|
jpayne@69
|
310 1 -1
|
jpayne@69
|
311 86855 102105 23 - -
|
jpayne@69
|
312 86882 102078 77 4 4
|