comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/promer.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 --------------------------------------------------------------------------------
2 PROmer3.0:
3 An extension of the MUMmer package that calculates alignments
4 between two DNA multi-fasta files using all 6 translated amino acid
5 reading frames.
6
7 Use Cases:
8 + comparing two fairly divergent genomes that have large rearrangements
9 and may only be similar on the protein level
10 + comparative genome annotation, i.e. using an already annotated genome
11 to help in the annotation of a newly sequenced genome
12 + identifying syntenic regions between highly divergent genomes
13
14 If any of this code is used in any publication, please cite the following:
15
16 Versatile and open software for comparing large genomes.
17 S. Kurtz, A. Phillippy, A.L. Delcher,
18 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
19 Genome Biology (2004), 5:R12.
20
21 --------------------------------------------------------------------------------
22
23 ** NOTE **
24 This manual is outdated, please refer to the HTML documentation included in
25 this distribution or at:
26
27 http://mummer.sourceforge.net
28 http://mummer.sourceforge.net/manual
29 http://mummer.sourceforge.net/examples
30
31 -- DESCRIPTION --
32 PROmer3.0 (PROtein MUMmer) is a suite of programs to modify and refine
33 the basic output of the MUMmer3.0 matching program 'mummer'. PROmer pre-
34 processes the DNA multi-FASTA input files, and translates them in all 6
35 amino acid reading frames so that they can be examined by the match finding
36 routine. After which, the matches are clustered and the matches within
37 clusters are extended via Smith-Waterman techniques in order to expand
38 the total alignment coverage and close the gaps between clustered MUMs. The
39 "out.delta" file contains the final alignment data, encoded
40 with a style called delta encoding. Any of the 'show-*' programs are able to
41 parse this file and present its information in a human readable format.
42
43
44 -- PROmer3.0 EXAMPLE --
45 To compare two eukaryotic genomes, genome1.fasta and genome2.fasta,
46 (all chromosomes vs all chromosomes) type:
47
48 "promer -o -p output genome1.fasta genome2.fasta"
49
50 Output will be...
51 output.delta // alignment data encoded with delta encoding
52 output.coords // list of alignments, % identity, etc...
53
54 To generate more output, investigate the options of any of the 'show-*'
55 programs, these programs can interpret the .delta output of PROmer and provide
56 useful information regarding the alignment. In addition, dotplots can be
57 generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
58 the 'delta-filter' utility is very useful for removing chance and repeat-induced
59 alignments. It can significantly reduce the number of alignments in the nucmer
60 output, making it easier to interpret (see html manual for more information).
61
62
63 -- RUNNING 'promer' --
64
65 USAGE: promer [options] <Reference> <Query>
66
67
68 MANDATORY:
69 Reference Set the input reference multi-FASTA DNA file to "Reference"
70 Query Set the input query multi-FASTA DNA file to "Query"
71
72
73 OPTIONS:
74 --mum Use only maximal exact matches that are unique in both the
75 query and reference sequences as the alignment anchors.
76
77 --mumreference Use only maximal exact matches that are unique in the
78 reference sequences as the alignment anchors.
79
80 --maxmatch Use all maximal exact matches as the alignment anchors.
81
82 -b breakLen Set the distance an alignment extension will attempt to
83 extend poor scoring regions before giving up. The default
84 distance is 60. This distance should be measured in amino
85 acids, and it effects the tolerance to error of the
86 alignment extensions. A higher value will result in greater
87 tolerance to error in hopes of finding good alignments on
88 the other side of a poorly scoring region.
89
90 -c|mincluster Sets the minimum length of a cluster. The default value is
91 20. This length should be measured in amino acids, and the
92 length of a match cluster is determined by the sum of the
93 lengths of the matches within. A higher value will decrease
94 the sensitivity of the alignment, but will also result in
95 more confident results.
96
97 --[no]delta Toggles the creation of the delta file. The default
98 behavior is --delta, but disabling the delta file will
99 speed up the finishing stage by not creating alignments.
100 This option implies --noextend.
101
102 --depend Print the dependency information and exit.
103
104 -d|diagfactor Set the clustering fraction of separation for diagonal
105 difference. The default value is .11. A higher value will
106 increase the tolerance of the clustering algorithm and
107 allow for more indels in a cluster.
108
109 --[no]extend Toggles the outward extension of alignments from their
110 anchoring clusters. The default behavior is --extend, but
111 disabling the extensions will speed up the finishing stage
112 by not extending alignments. Clusters will still be fused
113 into alignments, but they will not be expanded outward.
114
115 -g|maxgap Set the maximum gap between two adjacent matches in a
116 cluster. The default value is 30. This gap distance should
117 be measured in amino acids. A smaller value will result in
118 smaller (but more) clusters, a larger value will result in
119 larger (but fewer) clusters.
120
121 -h
122 --help Display help information and exit.
123
124 -l|minmatch Set the minimum length of a single match. The default value
125 is 6. This value should be measured in amino acids.
126 Reducing this value will possibly increase the sensitivity
127 of the alignment, but it will also allow for chance or
128 "noise" matches. Take note that lowering this value will
129 significantly increase runtime.
130
131 -o
132 -coords Automatically generate the "out.coords" file using the
133 'show-coords' program. This file lists all the alignments
134 sorted by their reference coordinate in a user friendly
135 format, without requiring the user to run 'show-coords'
136 independently of promer.
137
138 --[no]optimize Toggle alignment score optimization, i.e. if an alignment
139 extension reaches the end of a sequence, it will backtrack
140 to optimize the alignment score instead of terminating the
141 alignment at the end of the sequence. By turning this
142 option off, alignments within -b AAs of the sequence end
143 will be forced to extend to the end. Default behavior is
144 --optimize, --nooptimize will result in longer alignments
145 but may lead to lower alignment scores.
146
147 -p|prefix Set the prefix of the output files. The default prefix is
148 "out". Take note that promer will allow the user to
149 overwrite existing files, so a unique prefix should be used
150 for each subsequent run of promer to avoid data loss.
151
152 -V
153 --version Display the version information and exit
154
155 -x|matrix Set the BLOSUM matrix number. The default
156 value is "2" (BLOSUM 62), other available choices include
157 "1" (BLOSUM 45) and "3" (BLOSUM 80).
158
159
160 -- NOTES --
161 When comparing two entire genomes, it is very helpful to mask the
162 "uninteresting" regions of input using a utility such as "nseg" or "dust".
163 This will allow the program to focus solely on aligning the regions of
164 interest. All unrecognized codons will not be matched, so most any masking
165 character is appropriate, we recommend 'N' or 'X'.
166 Since 'promer' runs so quickly, it can be useful to run it numerous times
167 with different parameters to fine-tune the resulting alignment and include or
168 exclude missed or chance matches. It is also helpful to try the different
169 uniqueness switches to attain the appropriate level of detail in the resulting
170 output.
171
172
173
174 -- OUTPUT FILES --
175
176 *** .delta OUTPUT ***
177
178 This output file is a representation of the all-vs-all alignment between
179 the sequences contained in the multi-FASTA input files. It catalogs the
180 coordinates of aligned regions and the distance between insertions and deletions
181 contained in these alignment regions. The first two lines of the file are
182 identical to the .cluster output. The first line lists the two original input
183 files separated by a space, and the second line specifies the alignment data
184 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
185 a header, just like the cluster's header in the .cluster file. This is a FASTA
186 style header and lists the two sequences that produced the following alignments
187 after a '>' and separated by a space, after the two sequences are the lengths
188 of those sequences in the same order. An example header might look like:
189
190 >tagA1 tagB1 500 2000000
191
192 Following this sequence header is the alignment data. Each alignment region
193 has a header that describes the start and end coordinates of the alignment in
194 each sequence. These coordinates are inclusive and reference the forward strand
195 of the current sequence. Thus, if the start coordinate is greater than the end
196 coordinate, the alignment is on the reverse strand. The four digits are the
197 start and end in the reference sequence respectively and the start and end in
198 the query sequence respectively. These coordinates are ALWAYS measured in DNA
199 bases regardless of the alignment data type. The three digits after the starts
200 and stops are the number of errors (non-identities), similarity errors (non-
201 positive match scores) and stop codons. An example header might look like:
202
203 2631 3401 2464 3234 15 15 2
204
205 Notice that the start coordinate points to the first base in the first codon,
206 and the end coordinate points to the last base in the last codon. Therefore
207 making (end - start + 1) % 3 = 0.
208 Each of these headers is followed by a string of signed digits, one per line,
209 with the final line before the next header equaling 0 (zero). Each digit
210 represents the distance to the next insertion in the reference (positive int)
211 or deletion in the reference (negative int), as measured in DNA bases OR amino
212 acids depending on the alignment data type. For example, with 'promer' the
213 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
214 in the translated reference sequence and an insertion at position 3 in the
215 translated query sequence.
216 Or with letters:
217
218 A = VBPWVPBWPVP$
219 B = BPPWVPWPVP$
220 Delta = (1, -3, 4, 0)
221 A = VBP.WVPBWPVP$
222 B = .BPPWVP.WPVP$
223
224 Using this delta information, it is possible to re-generate the alignment
225 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
226 allows various utilities to be crafted to process and analyze the alignment
227 data using a universal format. Below is what a .delta file might look like:
228
229 /home/username/reference.fasta /home/username/query.fasta
230 PROMER
231 >tagA1 tagB1 3000000 2000000
232 1667803 1667078 1641506 1640769 14 7 2
233 -145
234 -3
235 -1
236 -40
237 0
238 1667804 1667079 1641507 1640770 10 5 3
239 -146
240 -1
241 -1
242 -34
243 0
244 >tagA2 tagB4 4000 3000
245 2631 3401 2464 3234 4 0 0
246 0
247 2608 3402 2456 3235 10 5 0
248 7
249 1
250 1
251 1
252 1
253 0
254
255
256
257 *** .cluster OUTPUT ***
258
259 This output format is for debugging purposes and is now only available by
260 using the -d switch for the 'postnuc' program.
261
262 This output file is a list of the match clusters that were generated by the
263 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
264 of the headers to be described later. 2 example rows could read:
265
266 1788 1622 59 - -
267 1857 1691 23 10 10
268
269 Where the first column is the start coordinate of the match in the reference
270 sequence, the second column is the start coordinate of the match in the query
271 sequence, the third column is the length of the match, and the two final
272 columns are the distance between the previous match's end and the current
273 match's start (the gap distance). All coordinates reference the forward strand
274 of each sequence, regardless of match direction, and are ALWAYS measured in
275 DNA bases regardless of alignment data type (DNA or amino acid). Therefore,
276 when running 'promer', all the numbers in the length column must be multiples
277 of three.
278 Each individual cluster is preceded by two digits (-1,-2,-3, 1, 2, 3). These
279 two digits represent the reading frame of the cluster, either forward or
280 reverse with offsets of 1,2 or 3. A " 3 -1" would represent a match on the
281 forward 3rd reading frame in the reference and on the reverse 1st reading frame
282 in the query sequence. Take note that since the match coordinates reference the
283 forward DNA strand, forward matches will have ascending matches and a reverse
284 matches will have descending matches. The reference may also be reversed in this
285 file, so expect the first number to sometimes be negative.
286 There are also 3 other types of headers. The first line of each .cluster
287 file lists the two original input files separated by a space. The second line
288 of each .cluster file lists the type of alignment data, either "NUCMER" or
289 "PROMER". The third type of header resembles a FASTA header, and lists the
290 two sequences that produced the following clusters after a '>' and their
291 respective lengths separated by a whitespace. Note that each of these headers
292 is unique, so all clusters/matches between any two sequences will appear under
293 a single header identifying those two sequences. Below is a short example of
294 what a .cluster file might look like:
295
296 /home/username/reference.fasta /home/username/query.fasta
297 PROMER
298 >tagA1 tagB1 1000 2000000
299 1 3
300 184 18 21 - -
301 223 57 123 18 18
302 3 2
303 168 2 30 - -
304 288 122 51 90 90
305 354 188 84 15 15
306 483 317 24 45 45
307 558 392 81 51 51
308 642 476 144 3 3
309 >tagA2 tagB1 2000000 2000000
310 -3 -2
311 1665663 1641799 18 - -
312 1665585 1641712 21 60 69
313 1665546 1641673 39 18 18
314