jpayne@69
|
1 --------------------------------------------------------------------------------
|
jpayne@69
|
2 PROmer3.0:
|
jpayne@69
|
3 An extension of the MUMmer package that calculates alignments
|
jpayne@69
|
4 between two DNA multi-fasta files using all 6 translated amino acid
|
jpayne@69
|
5 reading frames.
|
jpayne@69
|
6
|
jpayne@69
|
7 Use Cases:
|
jpayne@69
|
8 + comparing two fairly divergent genomes that have large rearrangements
|
jpayne@69
|
9 and may only be similar on the protein level
|
jpayne@69
|
10 + comparative genome annotation, i.e. using an already annotated genome
|
jpayne@69
|
11 to help in the annotation of a newly sequenced genome
|
jpayne@69
|
12 + identifying syntenic regions between highly divergent genomes
|
jpayne@69
|
13
|
jpayne@69
|
14 If any of this code is used in any publication, please cite the following:
|
jpayne@69
|
15
|
jpayne@69
|
16 Versatile and open software for comparing large genomes.
|
jpayne@69
|
17 S. Kurtz, A. Phillippy, A.L. Delcher,
|
jpayne@69
|
18 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
|
jpayne@69
|
19 Genome Biology (2004), 5:R12.
|
jpayne@69
|
20
|
jpayne@69
|
21 --------------------------------------------------------------------------------
|
jpayne@69
|
22
|
jpayne@69
|
23 ** NOTE **
|
jpayne@69
|
24 This manual is outdated, please refer to the HTML documentation included in
|
jpayne@69
|
25 this distribution or at:
|
jpayne@69
|
26
|
jpayne@69
|
27 http://mummer.sourceforge.net
|
jpayne@69
|
28 http://mummer.sourceforge.net/manual
|
jpayne@69
|
29 http://mummer.sourceforge.net/examples
|
jpayne@69
|
30
|
jpayne@69
|
31 -- DESCRIPTION --
|
jpayne@69
|
32 PROmer3.0 (PROtein MUMmer) is a suite of programs to modify and refine
|
jpayne@69
|
33 the basic output of the MUMmer3.0 matching program 'mummer'. PROmer pre-
|
jpayne@69
|
34 processes the DNA multi-FASTA input files, and translates them in all 6
|
jpayne@69
|
35 amino acid reading frames so that they can be examined by the match finding
|
jpayne@69
|
36 routine. After which, the matches are clustered and the matches within
|
jpayne@69
|
37 clusters are extended via Smith-Waterman techniques in order to expand
|
jpayne@69
|
38 the total alignment coverage and close the gaps between clustered MUMs. The
|
jpayne@69
|
39 "out.delta" file contains the final alignment data, encoded
|
jpayne@69
|
40 with a style called delta encoding. Any of the 'show-*' programs are able to
|
jpayne@69
|
41 parse this file and present its information in a human readable format.
|
jpayne@69
|
42
|
jpayne@69
|
43
|
jpayne@69
|
44 -- PROmer3.0 EXAMPLE --
|
jpayne@69
|
45 To compare two eukaryotic genomes, genome1.fasta and genome2.fasta,
|
jpayne@69
|
46 (all chromosomes vs all chromosomes) type:
|
jpayne@69
|
47
|
jpayne@69
|
48 "promer -o -p output genome1.fasta genome2.fasta"
|
jpayne@69
|
49
|
jpayne@69
|
50 Output will be...
|
jpayne@69
|
51 output.delta // alignment data encoded with delta encoding
|
jpayne@69
|
52 output.coords // list of alignments, % identity, etc...
|
jpayne@69
|
53
|
jpayne@69
|
54 To generate more output, investigate the options of any of the 'show-*'
|
jpayne@69
|
55 programs, these programs can interpret the .delta output of PROmer and provide
|
jpayne@69
|
56 useful information regarding the alignment. In addition, dotplots can be
|
jpayne@69
|
57 generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
|
jpayne@69
|
58 the 'delta-filter' utility is very useful for removing chance and repeat-induced
|
jpayne@69
|
59 alignments. It can significantly reduce the number of alignments in the nucmer
|
jpayne@69
|
60 output, making it easier to interpret (see html manual for more information).
|
jpayne@69
|
61
|
jpayne@69
|
62
|
jpayne@69
|
63 -- RUNNING 'promer' --
|
jpayne@69
|
64
|
jpayne@69
|
65 USAGE: promer [options] <Reference> <Query>
|
jpayne@69
|
66
|
jpayne@69
|
67
|
jpayne@69
|
68 MANDATORY:
|
jpayne@69
|
69 Reference Set the input reference multi-FASTA DNA file to "Reference"
|
jpayne@69
|
70 Query Set the input query multi-FASTA DNA file to "Query"
|
jpayne@69
|
71
|
jpayne@69
|
72
|
jpayne@69
|
73 OPTIONS:
|
jpayne@69
|
74 --mum Use only maximal exact matches that are unique in both the
|
jpayne@69
|
75 query and reference sequences as the alignment anchors.
|
jpayne@69
|
76
|
jpayne@69
|
77 --mumreference Use only maximal exact matches that are unique in the
|
jpayne@69
|
78 reference sequences as the alignment anchors.
|
jpayne@69
|
79
|
jpayne@69
|
80 --maxmatch Use all maximal exact matches as the alignment anchors.
|
jpayne@69
|
81
|
jpayne@69
|
82 -b breakLen Set the distance an alignment extension will attempt to
|
jpayne@69
|
83 extend poor scoring regions before giving up. The default
|
jpayne@69
|
84 distance is 60. This distance should be measured in amino
|
jpayne@69
|
85 acids, and it effects the tolerance to error of the
|
jpayne@69
|
86 alignment extensions. A higher value will result in greater
|
jpayne@69
|
87 tolerance to error in hopes of finding good alignments on
|
jpayne@69
|
88 the other side of a poorly scoring region.
|
jpayne@69
|
89
|
jpayne@69
|
90 -c|mincluster Sets the minimum length of a cluster. The default value is
|
jpayne@69
|
91 20. This length should be measured in amino acids, and the
|
jpayne@69
|
92 length of a match cluster is determined by the sum of the
|
jpayne@69
|
93 lengths of the matches within. A higher value will decrease
|
jpayne@69
|
94 the sensitivity of the alignment, but will also result in
|
jpayne@69
|
95 more confident results.
|
jpayne@69
|
96
|
jpayne@69
|
97 --[no]delta Toggles the creation of the delta file. The default
|
jpayne@69
|
98 behavior is --delta, but disabling the delta file will
|
jpayne@69
|
99 speed up the finishing stage by not creating alignments.
|
jpayne@69
|
100 This option implies --noextend.
|
jpayne@69
|
101
|
jpayne@69
|
102 --depend Print the dependency information and exit.
|
jpayne@69
|
103
|
jpayne@69
|
104 -d|diagfactor Set the clustering fraction of separation for diagonal
|
jpayne@69
|
105 difference. The default value is .11. A higher value will
|
jpayne@69
|
106 increase the tolerance of the clustering algorithm and
|
jpayne@69
|
107 allow for more indels in a cluster.
|
jpayne@69
|
108
|
jpayne@69
|
109 --[no]extend Toggles the outward extension of alignments from their
|
jpayne@69
|
110 anchoring clusters. The default behavior is --extend, but
|
jpayne@69
|
111 disabling the extensions will speed up the finishing stage
|
jpayne@69
|
112 by not extending alignments. Clusters will still be fused
|
jpayne@69
|
113 into alignments, but they will not be expanded outward.
|
jpayne@69
|
114
|
jpayne@69
|
115 -g|maxgap Set the maximum gap between two adjacent matches in a
|
jpayne@69
|
116 cluster. The default value is 30. This gap distance should
|
jpayne@69
|
117 be measured in amino acids. A smaller value will result in
|
jpayne@69
|
118 smaller (but more) clusters, a larger value will result in
|
jpayne@69
|
119 larger (but fewer) clusters.
|
jpayne@69
|
120
|
jpayne@69
|
121 -h
|
jpayne@69
|
122 --help Display help information and exit.
|
jpayne@69
|
123
|
jpayne@69
|
124 -l|minmatch Set the minimum length of a single match. The default value
|
jpayne@69
|
125 is 6. This value should be measured in amino acids.
|
jpayne@69
|
126 Reducing this value will possibly increase the sensitivity
|
jpayne@69
|
127 of the alignment, but it will also allow for chance or
|
jpayne@69
|
128 "noise" matches. Take note that lowering this value will
|
jpayne@69
|
129 significantly increase runtime.
|
jpayne@69
|
130
|
jpayne@69
|
131 -o
|
jpayne@69
|
132 -coords Automatically generate the "out.coords" file using the
|
jpayne@69
|
133 'show-coords' program. This file lists all the alignments
|
jpayne@69
|
134 sorted by their reference coordinate in a user friendly
|
jpayne@69
|
135 format, without requiring the user to run 'show-coords'
|
jpayne@69
|
136 independently of promer.
|
jpayne@69
|
137
|
jpayne@69
|
138 --[no]optimize Toggle alignment score optimization, i.e. if an alignment
|
jpayne@69
|
139 extension reaches the end of a sequence, it will backtrack
|
jpayne@69
|
140 to optimize the alignment score instead of terminating the
|
jpayne@69
|
141 alignment at the end of the sequence. By turning this
|
jpayne@69
|
142 option off, alignments within -b AAs of the sequence end
|
jpayne@69
|
143 will be forced to extend to the end. Default behavior is
|
jpayne@69
|
144 --optimize, --nooptimize will result in longer alignments
|
jpayne@69
|
145 but may lead to lower alignment scores.
|
jpayne@69
|
146
|
jpayne@69
|
147 -p|prefix Set the prefix of the output files. The default prefix is
|
jpayne@69
|
148 "out". Take note that promer will allow the user to
|
jpayne@69
|
149 overwrite existing files, so a unique prefix should be used
|
jpayne@69
|
150 for each subsequent run of promer to avoid data loss.
|
jpayne@69
|
151
|
jpayne@69
|
152 -V
|
jpayne@69
|
153 --version Display the version information and exit
|
jpayne@69
|
154
|
jpayne@69
|
155 -x|matrix Set the BLOSUM matrix number. The default
|
jpayne@69
|
156 value is "2" (BLOSUM 62), other available choices include
|
jpayne@69
|
157 "1" (BLOSUM 45) and "3" (BLOSUM 80).
|
jpayne@69
|
158
|
jpayne@69
|
159
|
jpayne@69
|
160 -- NOTES --
|
jpayne@69
|
161 When comparing two entire genomes, it is very helpful to mask the
|
jpayne@69
|
162 "uninteresting" regions of input using a utility such as "nseg" or "dust".
|
jpayne@69
|
163 This will allow the program to focus solely on aligning the regions of
|
jpayne@69
|
164 interest. All unrecognized codons will not be matched, so most any masking
|
jpayne@69
|
165 character is appropriate, we recommend 'N' or 'X'.
|
jpayne@69
|
166 Since 'promer' runs so quickly, it can be useful to run it numerous times
|
jpayne@69
|
167 with different parameters to fine-tune the resulting alignment and include or
|
jpayne@69
|
168 exclude missed or chance matches. It is also helpful to try the different
|
jpayne@69
|
169 uniqueness switches to attain the appropriate level of detail in the resulting
|
jpayne@69
|
170 output.
|
jpayne@69
|
171
|
jpayne@69
|
172
|
jpayne@69
|
173
|
jpayne@69
|
174 -- OUTPUT FILES --
|
jpayne@69
|
175
|
jpayne@69
|
176 *** .delta OUTPUT ***
|
jpayne@69
|
177
|
jpayne@69
|
178 This output file is a representation of the all-vs-all alignment between
|
jpayne@69
|
179 the sequences contained in the multi-FASTA input files. It catalogs the
|
jpayne@69
|
180 coordinates of aligned regions and the distance between insertions and deletions
|
jpayne@69
|
181 contained in these alignment regions. The first two lines of the file are
|
jpayne@69
|
182 identical to the .cluster output. The first line lists the two original input
|
jpayne@69
|
183 files separated by a space, and the second line specifies the alignment data
|
jpayne@69
|
184 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
|
jpayne@69
|
185 a header, just like the cluster's header in the .cluster file. This is a FASTA
|
jpayne@69
|
186 style header and lists the two sequences that produced the following alignments
|
jpayne@69
|
187 after a '>' and separated by a space, after the two sequences are the lengths
|
jpayne@69
|
188 of those sequences in the same order. An example header might look like:
|
jpayne@69
|
189
|
jpayne@69
|
190 >tagA1 tagB1 500 2000000
|
jpayne@69
|
191
|
jpayne@69
|
192 Following this sequence header is the alignment data. Each alignment region
|
jpayne@69
|
193 has a header that describes the start and end coordinates of the alignment in
|
jpayne@69
|
194 each sequence. These coordinates are inclusive and reference the forward strand
|
jpayne@69
|
195 of the current sequence. Thus, if the start coordinate is greater than the end
|
jpayne@69
|
196 coordinate, the alignment is on the reverse strand. The four digits are the
|
jpayne@69
|
197 start and end in the reference sequence respectively and the start and end in
|
jpayne@69
|
198 the query sequence respectively. These coordinates are ALWAYS measured in DNA
|
jpayne@69
|
199 bases regardless of the alignment data type. The three digits after the starts
|
jpayne@69
|
200 and stops are the number of errors (non-identities), similarity errors (non-
|
jpayne@69
|
201 positive match scores) and stop codons. An example header might look like:
|
jpayne@69
|
202
|
jpayne@69
|
203 2631 3401 2464 3234 15 15 2
|
jpayne@69
|
204
|
jpayne@69
|
205 Notice that the start coordinate points to the first base in the first codon,
|
jpayne@69
|
206 and the end coordinate points to the last base in the last codon. Therefore
|
jpayne@69
|
207 making (end - start + 1) % 3 = 0.
|
jpayne@69
|
208 Each of these headers is followed by a string of signed digits, one per line,
|
jpayne@69
|
209 with the final line before the next header equaling 0 (zero). Each digit
|
jpayne@69
|
210 represents the distance to the next insertion in the reference (positive int)
|
jpayne@69
|
211 or deletion in the reference (negative int), as measured in DNA bases OR amino
|
jpayne@69
|
212 acids depending on the alignment data type. For example, with 'promer' the
|
jpayne@69
|
213 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
|
jpayne@69
|
214 in the translated reference sequence and an insertion at position 3 in the
|
jpayne@69
|
215 translated query sequence.
|
jpayne@69
|
216 Or with letters:
|
jpayne@69
|
217
|
jpayne@69
|
218 A = VBPWVPBWPVP$
|
jpayne@69
|
219 B = BPPWVPWPVP$
|
jpayne@69
|
220 Delta = (1, -3, 4, 0)
|
jpayne@69
|
221 A = VBP.WVPBWPVP$
|
jpayne@69
|
222 B = .BPPWVP.WPVP$
|
jpayne@69
|
223
|
jpayne@69
|
224 Using this delta information, it is possible to re-generate the alignment
|
jpayne@69
|
225 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
|
jpayne@69
|
226 allows various utilities to be crafted to process and analyze the alignment
|
jpayne@69
|
227 data using a universal format. Below is what a .delta file might look like:
|
jpayne@69
|
228
|
jpayne@69
|
229 /home/username/reference.fasta /home/username/query.fasta
|
jpayne@69
|
230 PROMER
|
jpayne@69
|
231 >tagA1 tagB1 3000000 2000000
|
jpayne@69
|
232 1667803 1667078 1641506 1640769 14 7 2
|
jpayne@69
|
233 -145
|
jpayne@69
|
234 -3
|
jpayne@69
|
235 -1
|
jpayne@69
|
236 -40
|
jpayne@69
|
237 0
|
jpayne@69
|
238 1667804 1667079 1641507 1640770 10 5 3
|
jpayne@69
|
239 -146
|
jpayne@69
|
240 -1
|
jpayne@69
|
241 -1
|
jpayne@69
|
242 -34
|
jpayne@69
|
243 0
|
jpayne@69
|
244 >tagA2 tagB4 4000 3000
|
jpayne@69
|
245 2631 3401 2464 3234 4 0 0
|
jpayne@69
|
246 0
|
jpayne@69
|
247 2608 3402 2456 3235 10 5 0
|
jpayne@69
|
248 7
|
jpayne@69
|
249 1
|
jpayne@69
|
250 1
|
jpayne@69
|
251 1
|
jpayne@69
|
252 1
|
jpayne@69
|
253 0
|
jpayne@69
|
254
|
jpayne@69
|
255
|
jpayne@69
|
256
|
jpayne@69
|
257 *** .cluster OUTPUT ***
|
jpayne@69
|
258
|
jpayne@69
|
259 This output format is for debugging purposes and is now only available by
|
jpayne@69
|
260 using the -d switch for the 'postnuc' program.
|
jpayne@69
|
261
|
jpayne@69
|
262 This output file is a list of the match clusters that were generated by the
|
jpayne@69
|
263 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
|
jpayne@69
|
264 of the headers to be described later. 2 example rows could read:
|
jpayne@69
|
265
|
jpayne@69
|
266 1788 1622 59 - -
|
jpayne@69
|
267 1857 1691 23 10 10
|
jpayne@69
|
268
|
jpayne@69
|
269 Where the first column is the start coordinate of the match in the reference
|
jpayne@69
|
270 sequence, the second column is the start coordinate of the match in the query
|
jpayne@69
|
271 sequence, the third column is the length of the match, and the two final
|
jpayne@69
|
272 columns are the distance between the previous match's end and the current
|
jpayne@69
|
273 match's start (the gap distance). All coordinates reference the forward strand
|
jpayne@69
|
274 of each sequence, regardless of match direction, and are ALWAYS measured in
|
jpayne@69
|
275 DNA bases regardless of alignment data type (DNA or amino acid). Therefore,
|
jpayne@69
|
276 when running 'promer', all the numbers in the length column must be multiples
|
jpayne@69
|
277 of three.
|
jpayne@69
|
278 Each individual cluster is preceded by two digits (-1,-2,-3, 1, 2, 3). These
|
jpayne@69
|
279 two digits represent the reading frame of the cluster, either forward or
|
jpayne@69
|
280 reverse with offsets of 1,2 or 3. A " 3 -1" would represent a match on the
|
jpayne@69
|
281 forward 3rd reading frame in the reference and on the reverse 1st reading frame
|
jpayne@69
|
282 in the query sequence. Take note that since the match coordinates reference the
|
jpayne@69
|
283 forward DNA strand, forward matches will have ascending matches and a reverse
|
jpayne@69
|
284 matches will have descending matches. The reference may also be reversed in this
|
jpayne@69
|
285 file, so expect the first number to sometimes be negative.
|
jpayne@69
|
286 There are also 3 other types of headers. The first line of each .cluster
|
jpayne@69
|
287 file lists the two original input files separated by a space. The second line
|
jpayne@69
|
288 of each .cluster file lists the type of alignment data, either "NUCMER" or
|
jpayne@69
|
289 "PROMER". The third type of header resembles a FASTA header, and lists the
|
jpayne@69
|
290 two sequences that produced the following clusters after a '>' and their
|
jpayne@69
|
291 respective lengths separated by a whitespace. Note that each of these headers
|
jpayne@69
|
292 is unique, so all clusters/matches between any two sequences will appear under
|
jpayne@69
|
293 a single header identifying those two sequences. Below is a short example of
|
jpayne@69
|
294 what a .cluster file might look like:
|
jpayne@69
|
295
|
jpayne@69
|
296 /home/username/reference.fasta /home/username/query.fasta
|
jpayne@69
|
297 PROMER
|
jpayne@69
|
298 >tagA1 tagB1 1000 2000000
|
jpayne@69
|
299 1 3
|
jpayne@69
|
300 184 18 21 - -
|
jpayne@69
|
301 223 57 123 18 18
|
jpayne@69
|
302 3 2
|
jpayne@69
|
303 168 2 30 - -
|
jpayne@69
|
304 288 122 51 90 90
|
jpayne@69
|
305 354 188 84 15 15
|
jpayne@69
|
306 483 317 24 45 45
|
jpayne@69
|
307 558 392 81 51 51
|
jpayne@69
|
308 642 476 144 3 3
|
jpayne@69
|
309 >tagA2 tagB1 2000000 2000000
|
jpayne@69
|
310 -3 -2
|
jpayne@69
|
311 1665663 1641799 18 - -
|
jpayne@69
|
312 1665585 1641712 21 60 69
|
jpayne@69
|
313 1665546 1641673 39 18 18
|
jpayne@69
|
314
|