Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/promer.README @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 69:33d812a61356 |
---|---|
1 -------------------------------------------------------------------------------- | |
2 PROmer3.0: | |
3 An extension of the MUMmer package that calculates alignments | |
4 between two DNA multi-fasta files using all 6 translated amino acid | |
5 reading frames. | |
6 | |
7 Use Cases: | |
8 + comparing two fairly divergent genomes that have large rearrangements | |
9 and may only be similar on the protein level | |
10 + comparative genome annotation, i.e. using an already annotated genome | |
11 to help in the annotation of a newly sequenced genome | |
12 + identifying syntenic regions between highly divergent genomes | |
13 | |
14 If any of this code is used in any publication, please cite the following: | |
15 | |
16 Versatile and open software for comparing large genomes. | |
17 S. Kurtz, A. Phillippy, A.L. Delcher, | |
18 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. | |
19 Genome Biology (2004), 5:R12. | |
20 | |
21 -------------------------------------------------------------------------------- | |
22 | |
23 ** NOTE ** | |
24 This manual is outdated, please refer to the HTML documentation included in | |
25 this distribution or at: | |
26 | |
27 http://mummer.sourceforge.net | |
28 http://mummer.sourceforge.net/manual | |
29 http://mummer.sourceforge.net/examples | |
30 | |
31 -- DESCRIPTION -- | |
32 PROmer3.0 (PROtein MUMmer) is a suite of programs to modify and refine | |
33 the basic output of the MUMmer3.0 matching program 'mummer'. PROmer pre- | |
34 processes the DNA multi-FASTA input files, and translates them in all 6 | |
35 amino acid reading frames so that they can be examined by the match finding | |
36 routine. After which, the matches are clustered and the matches within | |
37 clusters are extended via Smith-Waterman techniques in order to expand | |
38 the total alignment coverage and close the gaps between clustered MUMs. The | |
39 "out.delta" file contains the final alignment data, encoded | |
40 with a style called delta encoding. Any of the 'show-*' programs are able to | |
41 parse this file and present its information in a human readable format. | |
42 | |
43 | |
44 -- PROmer3.0 EXAMPLE -- | |
45 To compare two eukaryotic genomes, genome1.fasta and genome2.fasta, | |
46 (all chromosomes vs all chromosomes) type: | |
47 | |
48 "promer -o -p output genome1.fasta genome2.fasta" | |
49 | |
50 Output will be... | |
51 output.delta // alignment data encoded with delta encoding | |
52 output.coords // list of alignments, % identity, etc... | |
53 | |
54 To generate more output, investigate the options of any of the 'show-*' | |
55 programs, these programs can interpret the .delta output of PROmer and provide | |
56 useful information regarding the alignment. In addition, dotplots can be | |
57 generated (if you have gnuplot installed) via the 'mummerplot' script. Also, | |
58 the 'delta-filter' utility is very useful for removing chance and repeat-induced | |
59 alignments. It can significantly reduce the number of alignments in the nucmer | |
60 output, making it easier to interpret (see html manual for more information). | |
61 | |
62 | |
63 -- RUNNING 'promer' -- | |
64 | |
65 USAGE: promer [options] <Reference> <Query> | |
66 | |
67 | |
68 MANDATORY: | |
69 Reference Set the input reference multi-FASTA DNA file to "Reference" | |
70 Query Set the input query multi-FASTA DNA file to "Query" | |
71 | |
72 | |
73 OPTIONS: | |
74 --mum Use only maximal exact matches that are unique in both the | |
75 query and reference sequences as the alignment anchors. | |
76 | |
77 --mumreference Use only maximal exact matches that are unique in the | |
78 reference sequences as the alignment anchors. | |
79 | |
80 --maxmatch Use all maximal exact matches as the alignment anchors. | |
81 | |
82 -b breakLen Set the distance an alignment extension will attempt to | |
83 extend poor scoring regions before giving up. The default | |
84 distance is 60. This distance should be measured in amino | |
85 acids, and it effects the tolerance to error of the | |
86 alignment extensions. A higher value will result in greater | |
87 tolerance to error in hopes of finding good alignments on | |
88 the other side of a poorly scoring region. | |
89 | |
90 -c|mincluster Sets the minimum length of a cluster. The default value is | |
91 20. This length should be measured in amino acids, and the | |
92 length of a match cluster is determined by the sum of the | |
93 lengths of the matches within. A higher value will decrease | |
94 the sensitivity of the alignment, but will also result in | |
95 more confident results. | |
96 | |
97 --[no]delta Toggles the creation of the delta file. The default | |
98 behavior is --delta, but disabling the delta file will | |
99 speed up the finishing stage by not creating alignments. | |
100 This option implies --noextend. | |
101 | |
102 --depend Print the dependency information and exit. | |
103 | |
104 -d|diagfactor Set the clustering fraction of separation for diagonal | |
105 difference. The default value is .11. A higher value will | |
106 increase the tolerance of the clustering algorithm and | |
107 allow for more indels in a cluster. | |
108 | |
109 --[no]extend Toggles the outward extension of alignments from their | |
110 anchoring clusters. The default behavior is --extend, but | |
111 disabling the extensions will speed up the finishing stage | |
112 by not extending alignments. Clusters will still be fused | |
113 into alignments, but they will not be expanded outward. | |
114 | |
115 -g|maxgap Set the maximum gap between two adjacent matches in a | |
116 cluster. The default value is 30. This gap distance should | |
117 be measured in amino acids. A smaller value will result in | |
118 smaller (but more) clusters, a larger value will result in | |
119 larger (but fewer) clusters. | |
120 | |
121 -h | |
122 --help Display help information and exit. | |
123 | |
124 -l|minmatch Set the minimum length of a single match. The default value | |
125 is 6. This value should be measured in amino acids. | |
126 Reducing this value will possibly increase the sensitivity | |
127 of the alignment, but it will also allow for chance or | |
128 "noise" matches. Take note that lowering this value will | |
129 significantly increase runtime. | |
130 | |
131 -o | |
132 -coords Automatically generate the "out.coords" file using the | |
133 'show-coords' program. This file lists all the alignments | |
134 sorted by their reference coordinate in a user friendly | |
135 format, without requiring the user to run 'show-coords' | |
136 independently of promer. | |
137 | |
138 --[no]optimize Toggle alignment score optimization, i.e. if an alignment | |
139 extension reaches the end of a sequence, it will backtrack | |
140 to optimize the alignment score instead of terminating the | |
141 alignment at the end of the sequence. By turning this | |
142 option off, alignments within -b AAs of the sequence end | |
143 will be forced to extend to the end. Default behavior is | |
144 --optimize, --nooptimize will result in longer alignments | |
145 but may lead to lower alignment scores. | |
146 | |
147 -p|prefix Set the prefix of the output files. The default prefix is | |
148 "out". Take note that promer will allow the user to | |
149 overwrite existing files, so a unique prefix should be used | |
150 for each subsequent run of promer to avoid data loss. | |
151 | |
152 -V | |
153 --version Display the version information and exit | |
154 | |
155 -x|matrix Set the BLOSUM matrix number. The default | |
156 value is "2" (BLOSUM 62), other available choices include | |
157 "1" (BLOSUM 45) and "3" (BLOSUM 80). | |
158 | |
159 | |
160 -- NOTES -- | |
161 When comparing two entire genomes, it is very helpful to mask the | |
162 "uninteresting" regions of input using a utility such as "nseg" or "dust". | |
163 This will allow the program to focus solely on aligning the regions of | |
164 interest. All unrecognized codons will not be matched, so most any masking | |
165 character is appropriate, we recommend 'N' or 'X'. | |
166 Since 'promer' runs so quickly, it can be useful to run it numerous times | |
167 with different parameters to fine-tune the resulting alignment and include or | |
168 exclude missed or chance matches. It is also helpful to try the different | |
169 uniqueness switches to attain the appropriate level of detail in the resulting | |
170 output. | |
171 | |
172 | |
173 | |
174 -- OUTPUT FILES -- | |
175 | |
176 *** .delta OUTPUT *** | |
177 | |
178 This output file is a representation of the all-vs-all alignment between | |
179 the sequences contained in the multi-FASTA input files. It catalogs the | |
180 coordinates of aligned regions and the distance between insertions and deletions | |
181 contained in these alignment regions. The first two lines of the file are | |
182 identical to the .cluster output. The first line lists the two original input | |
183 files separated by a space, and the second line specifies the alignment data | |
184 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have | |
185 a header, just like the cluster's header in the .cluster file. This is a FASTA | |
186 style header and lists the two sequences that produced the following alignments | |
187 after a '>' and separated by a space, after the two sequences are the lengths | |
188 of those sequences in the same order. An example header might look like: | |
189 | |
190 >tagA1 tagB1 500 2000000 | |
191 | |
192 Following this sequence header is the alignment data. Each alignment region | |
193 has a header that describes the start and end coordinates of the alignment in | |
194 each sequence. These coordinates are inclusive and reference the forward strand | |
195 of the current sequence. Thus, if the start coordinate is greater than the end | |
196 coordinate, the alignment is on the reverse strand. The four digits are the | |
197 start and end in the reference sequence respectively and the start and end in | |
198 the query sequence respectively. These coordinates are ALWAYS measured in DNA | |
199 bases regardless of the alignment data type. The three digits after the starts | |
200 and stops are the number of errors (non-identities), similarity errors (non- | |
201 positive match scores) and stop codons. An example header might look like: | |
202 | |
203 2631 3401 2464 3234 15 15 2 | |
204 | |
205 Notice that the start coordinate points to the first base in the first codon, | |
206 and the end coordinate points to the last base in the last codon. Therefore | |
207 making (end - start + 1) % 3 = 0. | |
208 Each of these headers is followed by a string of signed digits, one per line, | |
209 with the final line before the next header equaling 0 (zero). Each digit | |
210 represents the distance to the next insertion in the reference (positive int) | |
211 or deletion in the reference (negative int), as measured in DNA bases OR amino | |
212 acids depending on the alignment data type. For example, with 'promer' the | |
213 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7 | |
214 in the translated reference sequence and an insertion at position 3 in the | |
215 translated query sequence. | |
216 Or with letters: | |
217 | |
218 A = VBPWVPBWPVP$ | |
219 B = BPPWVPWPVP$ | |
220 Delta = (1, -3, 4, 0) | |
221 A = VBP.WVPBWPVP$ | |
222 B = .BPPWVP.WPVP$ | |
223 | |
224 Using this delta information, it is possible to re-generate the alignment | |
225 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This | |
226 allows various utilities to be crafted to process and analyze the alignment | |
227 data using a universal format. Below is what a .delta file might look like: | |
228 | |
229 /home/username/reference.fasta /home/username/query.fasta | |
230 PROMER | |
231 >tagA1 tagB1 3000000 2000000 | |
232 1667803 1667078 1641506 1640769 14 7 2 | |
233 -145 | |
234 -3 | |
235 -1 | |
236 -40 | |
237 0 | |
238 1667804 1667079 1641507 1640770 10 5 3 | |
239 -146 | |
240 -1 | |
241 -1 | |
242 -34 | |
243 0 | |
244 >tagA2 tagB4 4000 3000 | |
245 2631 3401 2464 3234 4 0 0 | |
246 0 | |
247 2608 3402 2456 3235 10 5 0 | |
248 7 | |
249 1 | |
250 1 | |
251 1 | |
252 1 | |
253 0 | |
254 | |
255 | |
256 | |
257 *** .cluster OUTPUT *** | |
258 | |
259 This output format is for debugging purposes and is now only available by | |
260 using the -d switch for the 'postnuc' program. | |
261 | |
262 This output file is a list of the match clusters that were generated by the | |
263 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception | |
264 of the headers to be described later. 2 example rows could read: | |
265 | |
266 1788 1622 59 - - | |
267 1857 1691 23 10 10 | |
268 | |
269 Where the first column is the start coordinate of the match in the reference | |
270 sequence, the second column is the start coordinate of the match in the query | |
271 sequence, the third column is the length of the match, and the two final | |
272 columns are the distance between the previous match's end and the current | |
273 match's start (the gap distance). All coordinates reference the forward strand | |
274 of each sequence, regardless of match direction, and are ALWAYS measured in | |
275 DNA bases regardless of alignment data type (DNA or amino acid). Therefore, | |
276 when running 'promer', all the numbers in the length column must be multiples | |
277 of three. | |
278 Each individual cluster is preceded by two digits (-1,-2,-3, 1, 2, 3). These | |
279 two digits represent the reading frame of the cluster, either forward or | |
280 reverse with offsets of 1,2 or 3. A " 3 -1" would represent a match on the | |
281 forward 3rd reading frame in the reference and on the reverse 1st reading frame | |
282 in the query sequence. Take note that since the match coordinates reference the | |
283 forward DNA strand, forward matches will have ascending matches and a reverse | |
284 matches will have descending matches. The reference may also be reversed in this | |
285 file, so expect the first number to sometimes be negative. | |
286 There are also 3 other types of headers. The first line of each .cluster | |
287 file lists the two original input files separated by a space. The second line | |
288 of each .cluster file lists the type of alignment data, either "NUCMER" or | |
289 "PROMER". The third type of header resembles a FASTA header, and lists the | |
290 two sequences that produced the following clusters after a '>' and their | |
291 respective lengths separated by a whitespace. Note that each of these headers | |
292 is unique, so all clusters/matches between any two sequences will appear under | |
293 a single header identifying those two sequences. Below is a short example of | |
294 what a .cluster file might look like: | |
295 | |
296 /home/username/reference.fasta /home/username/query.fasta | |
297 PROMER | |
298 >tagA1 tagB1 1000 2000000 | |
299 1 3 | |
300 184 18 21 - - | |
301 223 57 123 18 18 | |
302 3 2 | |
303 168 2 30 - - | |
304 288 122 51 90 90 | |
305 354 188 84 15 15 | |
306 483 317 24 45 45 | |
307 558 392 81 51 51 | |
308 642 476 144 3 3 | |
309 >tagA2 tagB1 2000000 2000000 | |
310 -3 -2 | |
311 1665663 1641799 18 - - | |
312 1665585 1641712 21 60 69 | |
313 1665546 1641673 39 18 18 | |
314 |