comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/nucmer.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 --------------------------------------------------------------------------------
2 NUCmer3.0:
3 An extension of the MUMmer package that calculates alignments
4 between two DNA multi-fasta files using the raw DNA sequence.
5
6 Use Cases:
7 + aligning two unfinished shotgun sequencing assemblies
8 + aligning an unfinished sequencing assembly to a finished genome
9 + comparing two fairly similar genomes that have large rearrangements
10
11 If any of this code is used in any publication, please cite the following:
12
13 Versatile and open software for comparing large genomes.
14 S. Kurtz, A. Phillippy, A.L. Delcher,
15 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
16 Genome Biology (2004), 5:R12.
17
18 --------------------------------------------------------------------------------
19
20 ** NOTE **
21 This manual is outdated, please refer to the HTML documentation included in
22 this distribution or at:
23
24 http://mummer.sourceforge.net
25 http://mummer.sourceforge.net/manual
26 http://mummer.sourceforge.net/examples
27
28 -- DESCRIPTION --
29 NUCmer3.0 (NUCleotide MUMmer) is a suite of programs to modify and refine
30 the basic output of the MUMmer3.0 matching program 'mummer'. NUCmer pre-
31 processes the DNA multi-FASTA input files so that they can be examined by the
32 match finding routine. After which, the matches are clustered and the matches
33 within clusters are extended via Smith-Waterman techniques in order to expand
34 the total alignment coverage and close the gaps between clustered MUMs. The
35 "out.delta" output file contains the final alignment data, encoded
36 with a style called delta encoding. Any of the 'show-*' programs are able to
37 parse this file and present its information in a human readable format.
38
39
40 -- NUCmer3.0 EXAMPLE --
41 To compare a set of assembly contigs "asmbl.fasta" to an already completed,
42 related genome "genome.fasta" type:
43
44 "nucmer -o -p output genome.fasta asmbl.fasta"
45
46 Output will be...
47 output.delta // alignment data encoded with delta encoding
48 output.coords // list of alignments, % identity, etc...
49
50 To generate more output, investigate the options of any of the 'show-*'
51 programs, these programs can interpret the .delta output of NUCmer and provide
52 useful information regarding the alignment. In addition, dotplots can be
53 generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
54 the 'delta-filter' utility is very useful for removing chance and repeat-induced
55 alignments. It can significantly reduce the number of alignments in the nucmer
56 output, making it easier to interpret (see html manual for more information).
57
58
59 -- RUNNING 'nucmer' --
60
61 USAGE: nucmer [options] <Reference> <Query>
62
63
64 MANDATORY:
65 Reference Set the input reference multi-FASTA file to "Reference"
66 Query Set the input query multi-FASTA file to "Query"
67
68
69 OPTIONS:
70 --mum Use only maximal exact matches that are unique in both the
71 query and reference sequences as the alignment anchors.
72
73 --mumreference Use only maximal exact matches that are unique in the
74 reference sequences as the alignment anchors.
75
76 --maxmatch Use all maximal exact matches as the alignment anchors.
77
78 -b|breakLen Set the distance an alignment extension will attempt to
79 extend poor scoring regions before giving up. The default
80 distance is 200. This distance should be measured in DNA
81 bases, and it effects the tolerance to error of the
82 alignment extensions. A higher value will result in greater
83 tolerance to error in hopes of finding good alignments on
84 the other side of a poorly scoring region.
85
86 -c|mincluster Sets the minimum length of a cluster. The default value is
87 65. The length of a match cluster is determined by the sum
88 of the lengths of the matches within. A higher value will
89 decrease the sensitivity of the alignment, but will also
90 result in more confident results.
91
92 --[no]delta Toggles the creation of the delta file. The default
93 behavior is --delta, but disabling the delta file will
94 speed up the finishing stage by not creating alignments.
95 This option implies --noextend.
96
97 --depend Print the dependency information and exit.
98
99 -d|diagfactor Set the clustering fraction of separation for diagonal
100 difference. The default value is .12. A higher value will
101 increase the tolerance of the clustering algorithm and
102 allow for more indels in a cluster.
103
104 --[no]extend Toggles the outward extension of alignments from their
105 anchoring clusters. The default behavior is --extend, but
106 disabling the extensions will speed up the finishing stage
107 by not extending alignments. Clusters will still be fused
108 into alignments, but they will not be expanded outward.
109
110 -f
111 --forward Use only the forward strand of the Query sequences. The
112 default behavior is to use both the forward and reverse
113 strands.
114
115 -g|maxgap Set the maximum gap between two adjacent matches in a
116 cluster. The default value is 90. A smaller value will
117 result in smaller (but more) clusters, a larger value will
118 result in larger (but fewer) clusters.
119
120 -h
121 --help Display help information and exit.
122
123 -l|minmatch Set the minimum length of a single match. The default value
124 is 20. Reducing this value will possibly increase the
125 sensitivity of the alignment, but it will also allow for
126 chance or "noise" matches. Take note that lowering this
127 value will significantly increase runtime.
128
129 -o
130 --coords Automatically generate the "out.coords" file using the
131 'show-coords' program. This file lists all the alignments
132 sorted by their reference coordinate in a user friendly
133 format, without requiring the user to run 'show-coords'
134 independently of nucmer.
135
136 --[no]optimize Toggle alignment score optimization, i.e. if an alignment
137 extension reaches the end of a sequence, it will backtrack
138 to optimize the alignment score instead of terminating the
139 alignment at the end of the sequence. By turning this
140 option off, alignments within -b bases of the sequence end
141 will be forced to extend to the end. Default behavior is
142 --optimize, --nooptimize will result in longer alignments
143 but may lead to lower alignment scores.
144
145 -p|prefix Set the prefix of the output files. The default prefix is
146 "out". Take note that nucmer will allow the user to
147 overwrite existing files, so a unique prefix should be used
148 for each subsequent run of nucmer to avoid data loss.
149
150 -r
151 --reverse Use only the reverse complement of the Query sequences. The
152 default behavior is to use both the forward and reverse
153 strands.
154
155 --[no]simplify Simplify alignments by removing shadowed clusters. This
156 is the default behavior, however it can be turned off if a
157 sequence is being aligned to itself in order to find inexact
158 repeats.
159
160 -V
161 --version Display the version information and exit
162
163
164
165 -- NOTES --
166 When comparing two entire genomes, it is very helpful to mask the
167 "uninteresting" regions of input using a utility such as "nseg" or "dust".
168 This will allow the program to focus solely on aligning the regions of
169 interest. Since only ACGT's will be matched, any other alpha character used
170 to mask the sequence will not be matched.
171 Since NUCmer runs so quickly, it can be useful to run it numerous times
172 with different parameters to fine-tune the resulting alignment and include or
173 exclude missed or chance matches. It is also helpful to try the different
174 uniqueness switches to attain the appropriate level of detail in the resulting
175 output.
176
177
178
179 -- OUTPUT FILES --
180
181 *** .delta OUTPUT ***
182
183 This output file is a representation of the all-vs-all alignment between
184 the sequences contained in the multi-FASTA input files. It catalogs the
185 coordinates of aligned regions and the distance between insertions and deletions
186 contained in these alignment regions. The first two lines of the file are
187 identical to the .cluster output. The first line lists the two original input
188 files separated by a space, and the second line specifies the alignment data
189 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
190 a header, just like the cluster's header in the .cluster file. This is a FASTA
191 style header and lists the two sequences that produced the following alignments
192 after a '>' and separated by a space, after the two sequences are the lengths
193 of those sequences in the same order. An example header might look like:
194
195 >tagA1 tagB1 500 2000000
196
197 Following this sequence header is the alignment data. Each alignment region
198 has a header that describes the start and end coordinates of the alignment in
199 each sequence. These coordinates are inclusive and reference the forward strand
200 of the current sequence. Thus, if the start coordinate is greater than the end
201 coordinate, the alignment is on the reverse strand. The four digits are the
202 start and end in the reference sequence respectively and the start and end in
203 the query sequence respectively. These coordinates are always measured in DNA
204 bases regardless of the alignment data type. The three digits after the starts
205 and stops are the number of errors (non-identities), similarity errors (non-
206 positive match scores) and non-alpha characters in the sequence (used to count
207 stop-codons i promer data). An example header might look like:
208
209 5198 22885 5389 23089 20 20 0
210
211 Each of these headers is followed by a string of signed digits, one per line,
212 with the final line before the next header equaling 0 (zero). Each digit
213 represents the distance to the next insertion in the reference (positive int)
214 or deletion in the reference (negative int), as measured in DNA bases or amino
215 acids depending on the alignment data type. For example, with 'nucmer' the
216 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
217 in the reference sequence and an insertion at position 3 in the query sequence.
218 Or with letters:
219
220 A = acgtagctgag$
221 B = cggtagtgag$
222 Delta = (1, -3, 4, 0)
223 A = acg.tagctgag$
224 B = .cggtag.tgag$
225
226 Using this delta information, it is possible to re-generate the alignment
227 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
228 allows various utilities to be crafted to process and analyze the alignment
229 data using a universal format. Below is what a .delta file might look like:
230
231 /home/username/reference.fasta /home/username/query.fasta
232 NUCMER
233 >tagA1 tagB1 500 2000000
234 88 198 1641558 1641668 0 0 0
235 0
236 167 4877 1 4714 15 15 0
237 2456
238 1
239 -11
240 769
241 950
242 1
243 1
244 -142
245 -1
246 0
247 >tagA2 tagB4 50000 30000
248 5198 22885 5389 23089 18 18 0
249 -6
250 -32
251 -1
252 -1
253 -1
254 7
255 1130
256 0
257
258
259
260 *** .cluster OUTPUT ***
261
262 This output format is for debugging purposes and is now only available by
263 using the -d switch for the 'postnuc' program.
264
265 This output file is a list of the match clusters that were generated by the
266 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
267 of the headers to be described later. 2 example rows could read:
268
269 1788 1622 59 - -
270 1857 1691 23 10 10
271
272 Where the first column is the start coordinate of the match in the reference
273 sequence, the second column is the start coordinate of the match in the query
274 sequence, the third column is the length of the match, and the two final
275 columns are the distance between the previous match's end and the current
276 match's start (the gap distance). All coordinates reference the forward strand
277 of each sequence, regardless of match direction, and are ALWAYS measured in
278 DNA bases regardless of alignment data type (DNA or amino acid).
279 Each individual cluster is preceded by two digits (1 or -1). These two
280 digits represent the direction of the cluster, either forward or reverse
281 complement, in each sequence. A " 1 -1" would represent a match on the forward
282 strand of the reference and the reverse strand of the query, while a " 1 1"
283 would represent a forward match on each strand. Take note that since the
284 match coordinates reference the forward strand, forward matches will have
285 ascending matches and a reverse matches will have descending matches. Also,
286 since the query is the only sequence every reverse complemented, expect the
287 first digit on the cluster header to always be 1.
288 There are also 3 other types of headers. The first line of each .cluster
289 file lists the two original input files separated by a space. The second line
290 of each .cluster file lists the type of alignment data, either "NUCMER" or
291 "PROMER". The third type of header resembles a FASTA header, and lists the
292 two sequences that produced the following clusters after a '>' and their
293 respective lengths separated by a whitespace. Note that each of these headers
294 is unique, so all clusters/matches between any two sequences will appear under
295 a single header identifying those two sequences. Below is a short example of
296 what a .cluster file might look like:
297
298 /home/username/reference.fasta /home/username/query.fasta
299 NUCMER
300 >tagA1 tagB1 1000 2000000
301 1 1
302 88 1641558 111 - -
303 1 1
304 183 17 22 - -
305 238 72 108 33 33
306 347 181 92 1 1
307 458 292 50 19 19
308 509 343 35 1 1
309 >tagA2 tagB1 100000 2000000
310 1 -1
311 86855 102105 23 - -
312 86882 102078 77 4 4