Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/nucmer.README @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 69:33d812a61356 |
---|---|
1 -------------------------------------------------------------------------------- | |
2 NUCmer3.0: | |
3 An extension of the MUMmer package that calculates alignments | |
4 between two DNA multi-fasta files using the raw DNA sequence. | |
5 | |
6 Use Cases: | |
7 + aligning two unfinished shotgun sequencing assemblies | |
8 + aligning an unfinished sequencing assembly to a finished genome | |
9 + comparing two fairly similar genomes that have large rearrangements | |
10 | |
11 If any of this code is used in any publication, please cite the following: | |
12 | |
13 Versatile and open software for comparing large genomes. | |
14 S. Kurtz, A. Phillippy, A.L. Delcher, | |
15 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. | |
16 Genome Biology (2004), 5:R12. | |
17 | |
18 -------------------------------------------------------------------------------- | |
19 | |
20 ** NOTE ** | |
21 This manual is outdated, please refer to the HTML documentation included in | |
22 this distribution or at: | |
23 | |
24 http://mummer.sourceforge.net | |
25 http://mummer.sourceforge.net/manual | |
26 http://mummer.sourceforge.net/examples | |
27 | |
28 -- DESCRIPTION -- | |
29 NUCmer3.0 (NUCleotide MUMmer) is a suite of programs to modify and refine | |
30 the basic output of the MUMmer3.0 matching program 'mummer'. NUCmer pre- | |
31 processes the DNA multi-FASTA input files so that they can be examined by the | |
32 match finding routine. After which, the matches are clustered and the matches | |
33 within clusters are extended via Smith-Waterman techniques in order to expand | |
34 the total alignment coverage and close the gaps between clustered MUMs. The | |
35 "out.delta" output file contains the final alignment data, encoded | |
36 with a style called delta encoding. Any of the 'show-*' programs are able to | |
37 parse this file and present its information in a human readable format. | |
38 | |
39 | |
40 -- NUCmer3.0 EXAMPLE -- | |
41 To compare a set of assembly contigs "asmbl.fasta" to an already completed, | |
42 related genome "genome.fasta" type: | |
43 | |
44 "nucmer -o -p output genome.fasta asmbl.fasta" | |
45 | |
46 Output will be... | |
47 output.delta // alignment data encoded with delta encoding | |
48 output.coords // list of alignments, % identity, etc... | |
49 | |
50 To generate more output, investigate the options of any of the 'show-*' | |
51 programs, these programs can interpret the .delta output of NUCmer and provide | |
52 useful information regarding the alignment. In addition, dotplots can be | |
53 generated (if you have gnuplot installed) via the 'mummerplot' script. Also, | |
54 the 'delta-filter' utility is very useful for removing chance and repeat-induced | |
55 alignments. It can significantly reduce the number of alignments in the nucmer | |
56 output, making it easier to interpret (see html manual for more information). | |
57 | |
58 | |
59 -- RUNNING 'nucmer' -- | |
60 | |
61 USAGE: nucmer [options] <Reference> <Query> | |
62 | |
63 | |
64 MANDATORY: | |
65 Reference Set the input reference multi-FASTA file to "Reference" | |
66 Query Set the input query multi-FASTA file to "Query" | |
67 | |
68 | |
69 OPTIONS: | |
70 --mum Use only maximal exact matches that are unique in both the | |
71 query and reference sequences as the alignment anchors. | |
72 | |
73 --mumreference Use only maximal exact matches that are unique in the | |
74 reference sequences as the alignment anchors. | |
75 | |
76 --maxmatch Use all maximal exact matches as the alignment anchors. | |
77 | |
78 -b|breakLen Set the distance an alignment extension will attempt to | |
79 extend poor scoring regions before giving up. The default | |
80 distance is 200. This distance should be measured in DNA | |
81 bases, and it effects the tolerance to error of the | |
82 alignment extensions. A higher value will result in greater | |
83 tolerance to error in hopes of finding good alignments on | |
84 the other side of a poorly scoring region. | |
85 | |
86 -c|mincluster Sets the minimum length of a cluster. The default value is | |
87 65. The length of a match cluster is determined by the sum | |
88 of the lengths of the matches within. A higher value will | |
89 decrease the sensitivity of the alignment, but will also | |
90 result in more confident results. | |
91 | |
92 --[no]delta Toggles the creation of the delta file. The default | |
93 behavior is --delta, but disabling the delta file will | |
94 speed up the finishing stage by not creating alignments. | |
95 This option implies --noextend. | |
96 | |
97 --depend Print the dependency information and exit. | |
98 | |
99 -d|diagfactor Set the clustering fraction of separation for diagonal | |
100 difference. The default value is .12. A higher value will | |
101 increase the tolerance of the clustering algorithm and | |
102 allow for more indels in a cluster. | |
103 | |
104 --[no]extend Toggles the outward extension of alignments from their | |
105 anchoring clusters. The default behavior is --extend, but | |
106 disabling the extensions will speed up the finishing stage | |
107 by not extending alignments. Clusters will still be fused | |
108 into alignments, but they will not be expanded outward. | |
109 | |
110 -f | |
111 --forward Use only the forward strand of the Query sequences. The | |
112 default behavior is to use both the forward and reverse | |
113 strands. | |
114 | |
115 -g|maxgap Set the maximum gap between two adjacent matches in a | |
116 cluster. The default value is 90. A smaller value will | |
117 result in smaller (but more) clusters, a larger value will | |
118 result in larger (but fewer) clusters. | |
119 | |
120 -h | |
121 --help Display help information and exit. | |
122 | |
123 -l|minmatch Set the minimum length of a single match. The default value | |
124 is 20. Reducing this value will possibly increase the | |
125 sensitivity of the alignment, but it will also allow for | |
126 chance or "noise" matches. Take note that lowering this | |
127 value will significantly increase runtime. | |
128 | |
129 -o | |
130 --coords Automatically generate the "out.coords" file using the | |
131 'show-coords' program. This file lists all the alignments | |
132 sorted by their reference coordinate in a user friendly | |
133 format, without requiring the user to run 'show-coords' | |
134 independently of nucmer. | |
135 | |
136 --[no]optimize Toggle alignment score optimization, i.e. if an alignment | |
137 extension reaches the end of a sequence, it will backtrack | |
138 to optimize the alignment score instead of terminating the | |
139 alignment at the end of the sequence. By turning this | |
140 option off, alignments within -b bases of the sequence end | |
141 will be forced to extend to the end. Default behavior is | |
142 --optimize, --nooptimize will result in longer alignments | |
143 but may lead to lower alignment scores. | |
144 | |
145 -p|prefix Set the prefix of the output files. The default prefix is | |
146 "out". Take note that nucmer will allow the user to | |
147 overwrite existing files, so a unique prefix should be used | |
148 for each subsequent run of nucmer to avoid data loss. | |
149 | |
150 -r | |
151 --reverse Use only the reverse complement of the Query sequences. The | |
152 default behavior is to use both the forward and reverse | |
153 strands. | |
154 | |
155 --[no]simplify Simplify alignments by removing shadowed clusters. This | |
156 is the default behavior, however it can be turned off if a | |
157 sequence is being aligned to itself in order to find inexact | |
158 repeats. | |
159 | |
160 -V | |
161 --version Display the version information and exit | |
162 | |
163 | |
164 | |
165 -- NOTES -- | |
166 When comparing two entire genomes, it is very helpful to mask the | |
167 "uninteresting" regions of input using a utility such as "nseg" or "dust". | |
168 This will allow the program to focus solely on aligning the regions of | |
169 interest. Since only ACGT's will be matched, any other alpha character used | |
170 to mask the sequence will not be matched. | |
171 Since NUCmer runs so quickly, it can be useful to run it numerous times | |
172 with different parameters to fine-tune the resulting alignment and include or | |
173 exclude missed or chance matches. It is also helpful to try the different | |
174 uniqueness switches to attain the appropriate level of detail in the resulting | |
175 output. | |
176 | |
177 | |
178 | |
179 -- OUTPUT FILES -- | |
180 | |
181 *** .delta OUTPUT *** | |
182 | |
183 This output file is a representation of the all-vs-all alignment between | |
184 the sequences contained in the multi-FASTA input files. It catalogs the | |
185 coordinates of aligned regions and the distance between insertions and deletions | |
186 contained in these alignment regions. The first two lines of the file are | |
187 identical to the .cluster output. The first line lists the two original input | |
188 files separated by a space, and the second line specifies the alignment data | |
189 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have | |
190 a header, just like the cluster's header in the .cluster file. This is a FASTA | |
191 style header and lists the two sequences that produced the following alignments | |
192 after a '>' and separated by a space, after the two sequences are the lengths | |
193 of those sequences in the same order. An example header might look like: | |
194 | |
195 >tagA1 tagB1 500 2000000 | |
196 | |
197 Following this sequence header is the alignment data. Each alignment region | |
198 has a header that describes the start and end coordinates of the alignment in | |
199 each sequence. These coordinates are inclusive and reference the forward strand | |
200 of the current sequence. Thus, if the start coordinate is greater than the end | |
201 coordinate, the alignment is on the reverse strand. The four digits are the | |
202 start and end in the reference sequence respectively and the start and end in | |
203 the query sequence respectively. These coordinates are always measured in DNA | |
204 bases regardless of the alignment data type. The three digits after the starts | |
205 and stops are the number of errors (non-identities), similarity errors (non- | |
206 positive match scores) and non-alpha characters in the sequence (used to count | |
207 stop-codons i promer data). An example header might look like: | |
208 | |
209 5198 22885 5389 23089 20 20 0 | |
210 | |
211 Each of these headers is followed by a string of signed digits, one per line, | |
212 with the final line before the next header equaling 0 (zero). Each digit | |
213 represents the distance to the next insertion in the reference (positive int) | |
214 or deletion in the reference (negative int), as measured in DNA bases or amino | |
215 acids depending on the alignment data type. For example, with 'nucmer' the | |
216 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7 | |
217 in the reference sequence and an insertion at position 3 in the query sequence. | |
218 Or with letters: | |
219 | |
220 A = acgtagctgag$ | |
221 B = cggtagtgag$ | |
222 Delta = (1, -3, 4, 0) | |
223 A = acg.tagctgag$ | |
224 B = .cggtag.tgag$ | |
225 | |
226 Using this delta information, it is possible to re-generate the alignment | |
227 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This | |
228 allows various utilities to be crafted to process and analyze the alignment | |
229 data using a universal format. Below is what a .delta file might look like: | |
230 | |
231 /home/username/reference.fasta /home/username/query.fasta | |
232 NUCMER | |
233 >tagA1 tagB1 500 2000000 | |
234 88 198 1641558 1641668 0 0 0 | |
235 0 | |
236 167 4877 1 4714 15 15 0 | |
237 2456 | |
238 1 | |
239 -11 | |
240 769 | |
241 950 | |
242 1 | |
243 1 | |
244 -142 | |
245 -1 | |
246 0 | |
247 >tagA2 tagB4 50000 30000 | |
248 5198 22885 5389 23089 18 18 0 | |
249 -6 | |
250 -32 | |
251 -1 | |
252 -1 | |
253 -1 | |
254 7 | |
255 1130 | |
256 0 | |
257 | |
258 | |
259 | |
260 *** .cluster OUTPUT *** | |
261 | |
262 This output format is for debugging purposes and is now only available by | |
263 using the -d switch for the 'postnuc' program. | |
264 | |
265 This output file is a list of the match clusters that were generated by the | |
266 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception | |
267 of the headers to be described later. 2 example rows could read: | |
268 | |
269 1788 1622 59 - - | |
270 1857 1691 23 10 10 | |
271 | |
272 Where the first column is the start coordinate of the match in the reference | |
273 sequence, the second column is the start coordinate of the match in the query | |
274 sequence, the third column is the length of the match, and the two final | |
275 columns are the distance between the previous match's end and the current | |
276 match's start (the gap distance). All coordinates reference the forward strand | |
277 of each sequence, regardless of match direction, and are ALWAYS measured in | |
278 DNA bases regardless of alignment data type (DNA or amino acid). | |
279 Each individual cluster is preceded by two digits (1 or -1). These two | |
280 digits represent the direction of the cluster, either forward or reverse | |
281 complement, in each sequence. A " 1 -1" would represent a match on the forward | |
282 strand of the reference and the reverse strand of the query, while a " 1 1" | |
283 would represent a forward match on each strand. Take note that since the | |
284 match coordinates reference the forward strand, forward matches will have | |
285 ascending matches and a reverse matches will have descending matches. Also, | |
286 since the query is the only sequence every reverse complemented, expect the | |
287 first digit on the cluster header to always be 1. | |
288 There are also 3 other types of headers. The first line of each .cluster | |
289 file lists the two original input files separated by a space. The second line | |
290 of each .cluster file lists the type of alignment data, either "NUCMER" or | |
291 "PROMER". The third type of header resembles a FASTA header, and lists the | |
292 two sequences that produced the following clusters after a '>' and their | |
293 respective lengths separated by a whitespace. Note that each of these headers | |
294 is unique, so all clusters/matches between any two sequences will appear under | |
295 a single header identifying those two sequences. Below is a short example of | |
296 what a .cluster file might look like: | |
297 | |
298 /home/username/reference.fasta /home/username/query.fasta | |
299 NUCMER | |
300 >tagA1 tagB1 1000 2000000 | |
301 1 1 | |
302 88 1641558 111 - - | |
303 1 1 | |
304 183 17 22 - - | |
305 238 72 108 33 33 | |
306 347 181 92 1 1 | |
307 458 292 50 19 19 | |
308 509 343 35 1 1 | |
309 >tagA2 tagB1 100000 2000000 | |
310 1 -1 | |
311 86855 102105 23 - - | |
312 86882 102078 77 4 4 |