csp2: CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/promer.README annotate

annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/promer.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d

author	jpayne
date	Tue, 18 Mar 2025 17:55:14 -0400
parents
children

rev	line source
jpayne@69	1 --------------------------------------------------------------------------------
jpayne@69	2 PROmer3.0:
jpayne@69	3 An extension of the MUMmer package that calculates alignments
jpayne@69	4 between two DNA multi-fasta files using all 6 translated amino acid
jpayne@69	5 reading frames.
jpayne@69	6
jpayne@69	7 Use Cases:
jpayne@69	8 + comparing two fairly divergent genomes that have large rearrangements
jpayne@69	9 and may only be similar on the protein level
jpayne@69	10 + comparative genome annotation, i.e. using an already annotated genome
jpayne@69	11 to help in the annotation of a newly sequenced genome
jpayne@69	12 + identifying syntenic regions between highly divergent genomes
jpayne@69	13
jpayne@69	14 If any of this code is used in any publication, please cite the following:
jpayne@69	15
jpayne@69	16 Versatile and open software for comparing large genomes.
jpayne@69	17 S. Kurtz, A. Phillippy, A.L. Delcher,
jpayne@69	18 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
jpayne@69	19 Genome Biology (2004), 5:R12.
jpayne@69	20
jpayne@69	21 --------------------------------------------------------------------------------
jpayne@69	22
jpayne@69	23 NOTE
jpayne@69	24 This manual is outdated, please refer to the HTML documentation included in
jpayne@69	25 this distribution or at:
jpayne@69	26
jpayne@69	27 http://mummer.sourceforge.net
jpayne@69	28 http://mummer.sourceforge.net/manual
jpayne@69	29 http://mummer.sourceforge.net/examples
jpayne@69	30
jpayne@69	31 -- DESCRIPTION --
jpayne@69	32 PROmer3.0 (PROtein MUMmer) is a suite of programs to modify and refine
jpayne@69	33 the basic output of the MUMmer3.0 matching program 'mummer'. PROmer pre-
jpayne@69	34 processes the DNA multi-FASTA input files, and translates them in all 6
jpayne@69	35 amino acid reading frames so that they can be examined by the match finding
jpayne@69	36 routine. After which, the matches are clustered and the matches within
jpayne@69	37 clusters are extended via Smith-Waterman techniques in order to expand
jpayne@69	38 the total alignment coverage and close the gaps between clustered MUMs. The
jpayne@69	39 "out.delta" file contains the final alignment data, encoded
jpayne@69	40 with a style called delta encoding. Any of the 'show-*' programs are able to
jpayne@69	41 parse this file and present its information in a human readable format.
jpayne@69	42
jpayne@69	43
jpayne@69	44 -- PROmer3.0 EXAMPLE --
jpayne@69	45 To compare two eukaryotic genomes, genome1.fasta and genome2.fasta,
jpayne@69	46 (all chromosomes vs all chromosomes) type:
jpayne@69	47
jpayne@69	48 "promer -o -p output genome1.fasta genome2.fasta"
jpayne@69	49
jpayne@69	50 Output will be...
jpayne@69	51 output.delta // alignment data encoded with delta encoding
jpayne@69	52 output.coords // list of alignments, % identity, etc...
jpayne@69	53
jpayne@69	54 To generate more output, investigate the options of any of the 'show-*'
jpayne@69	55 programs, these programs can interpret the .delta output of PROmer and provide
jpayne@69	56 useful information regarding the alignment. In addition, dotplots can be
jpayne@69	57 generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
jpayne@69	58 the 'delta-filter' utility is very useful for removing chance and repeat-induced
jpayne@69	59 alignments. It can significantly reduce the number of alignments in the nucmer
jpayne@69	60 output, making it easier to interpret (see html manual for more information).
jpayne@69	61
jpayne@69	62
jpayne@69	63 -- RUNNING 'promer' --
jpayne@69	64
jpayne@69	65 USAGE: promer [options] <Reference> <Query>
jpayne@69	66
jpayne@69	67
jpayne@69	68 MANDATORY:
jpayne@69	69 Reference Set the input reference multi-FASTA DNA file to "Reference"
jpayne@69	70 Query Set the input query multi-FASTA DNA file to "Query"
jpayne@69	71
jpayne@69	72
jpayne@69	73 OPTIONS:
jpayne@69	74 --mum Use only maximal exact matches that are unique in both the
jpayne@69	75 query and reference sequences as the alignment anchors.
jpayne@69	76
jpayne@69	77 --mumreference Use only maximal exact matches that are unique in the
jpayne@69	78 reference sequences as the alignment anchors.
jpayne@69	79
jpayne@69	80 --maxmatch Use all maximal exact matches as the alignment anchors.
jpayne@69	81
jpayne@69	82 -b breakLen Set the distance an alignment extension will attempt to
jpayne@69	83 extend poor scoring regions before giving up. The default
jpayne@69	84 distance is 60. This distance should be measured in amino
jpayne@69	85 acids, and it effects the tolerance to error of the
jpayne@69	86 alignment extensions. A higher value will result in greater
jpayne@69	87 tolerance to error in hopes of finding good alignments on
jpayne@69	88 the other side of a poorly scoring region.
jpayne@69	89
jpayne@69	90 -c\|mincluster Sets the minimum length of a cluster. The default value is
jpayne@69	91 20. This length should be measured in amino acids, and the
jpayne@69	92 length of a match cluster is determined by the sum of the
jpayne@69	93 lengths of the matches within. A higher value will decrease
jpayne@69	94 the sensitivity of the alignment, but will also result in
jpayne@69	95 more confident results.
jpayne@69	96
jpayne@69	97 --[no]delta Toggles the creation of the delta file. The default
jpayne@69	98 behavior is --delta, but disabling the delta file will
jpayne@69	99 speed up the finishing stage by not creating alignments.
jpayne@69	100 This option implies --noextend.
jpayne@69	101
jpayne@69	102 --depend Print the dependency information and exit.
jpayne@69	103
jpayne@69	104 -d\|diagfactor Set the clustering fraction of separation for diagonal
jpayne@69	105 difference. The default value is .11. A higher value will
jpayne@69	106 increase the tolerance of the clustering algorithm and
jpayne@69	107 allow for more indels in a cluster.
jpayne@69	108
jpayne@69	109 --[no]extend Toggles the outward extension of alignments from their
jpayne@69	110 anchoring clusters. The default behavior is --extend, but
jpayne@69	111 disabling the extensions will speed up the finishing stage
jpayne@69	112 by not extending alignments. Clusters will still be fused
jpayne@69	113 into alignments, but they will not be expanded outward.
jpayne@69	114
jpayne@69	115 -g\|maxgap Set the maximum gap between two adjacent matches in a
jpayne@69	116 cluster. The default value is 30. This gap distance should
jpayne@69	117 be measured in amino acids. A smaller value will result in
jpayne@69	118 smaller (but more) clusters, a larger value will result in
jpayne@69	119 larger (but fewer) clusters.
jpayne@69	120
jpayne@69	121 -h
jpayne@69	122 --help Display help information and exit.
jpayne@69	123
jpayne@69	124 -l\|minmatch Set the minimum length of a single match. The default value
jpayne@69	125 is 6. This value should be measured in amino acids.
jpayne@69	126 Reducing this value will possibly increase the sensitivity
jpayne@69	127 of the alignment, but it will also allow for chance or
jpayne@69	128 "noise" matches. Take note that lowering this value will
jpayne@69	129 significantly increase runtime.
jpayne@69	130
jpayne@69	131 -o
jpayne@69	132 -coords Automatically generate the "out.coords" file using the
jpayne@69	133 'show-coords' program. This file lists all the alignments
jpayne@69	134 sorted by their reference coordinate in a user friendly
jpayne@69	135 format, without requiring the user to run 'show-coords'
jpayne@69	136 independently of promer.
jpayne@69	137
jpayne@69	138 --[no]optimize Toggle alignment score optimization, i.e. if an alignment
jpayne@69	139 extension reaches the end of a sequence, it will backtrack
jpayne@69	140 to optimize the alignment score instead of terminating the
jpayne@69	141 alignment at the end of the sequence. By turning this
jpayne@69	142 option off, alignments within -b AAs of the sequence end
jpayne@69	143 will be forced to extend to the end. Default behavior is
jpayne@69	144 --optimize, --nooptimize will result in longer alignments
jpayne@69	145 but may lead to lower alignment scores.
jpayne@69	146
jpayne@69	147 -p\|prefix Set the prefix of the output files. The default prefix is
jpayne@69	148 "out". Take note that promer will allow the user to
jpayne@69	149 overwrite existing files, so a unique prefix should be used
jpayne@69	150 for each subsequent run of promer to avoid data loss.
jpayne@69	151
jpayne@69	152 -V
jpayne@69	153 --version Display the version information and exit
jpayne@69	154
jpayne@69	155 -x\|matrix Set the BLOSUM matrix number. The default
jpayne@69	156 value is "2" (BLOSUM 62), other available choices include
jpayne@69	157 "1" (BLOSUM 45) and "3" (BLOSUM 80).
jpayne@69	158
jpayne@69	159
jpayne@69	160 -- NOTES --
jpayne@69	161 When comparing two entire genomes, it is very helpful to mask the
jpayne@69	162 "uninteresting" regions of input using a utility such as "nseg" or "dust".
jpayne@69	163 This will allow the program to focus solely on aligning the regions of
jpayne@69	164 interest. All unrecognized codons will not be matched, so most any masking
jpayne@69	165 character is appropriate, we recommend 'N' or 'X'.
jpayne@69	166 Since 'promer' runs so quickly, it can be useful to run it numerous times
jpayne@69	167 with different parameters to fine-tune the resulting alignment and include or
jpayne@69	168 exclude missed or chance matches. It is also helpful to try the different
jpayne@69	169 uniqueness switches to attain the appropriate level of detail in the resulting
jpayne@69	170 output.
jpayne@69	171
jpayne@69	172
jpayne@69	173
jpayne@69	174 -- OUTPUT FILES --
jpayne@69	175
jpayne@69	176 * .delta OUTPUT *
jpayne@69	177
jpayne@69	178 This output file is a representation of the all-vs-all alignment between
jpayne@69	179 the sequences contained in the multi-FASTA input files. It catalogs the
jpayne@69	180 coordinates of aligned regions and the distance between insertions and deletions
jpayne@69	181 contained in these alignment regions. The first two lines of the file are
jpayne@69	182 identical to the .cluster output. The first line lists the two original input
jpayne@69	183 files separated by a space, and the second line specifies the alignment data
jpayne@69	184 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
jpayne@69	185 a header, just like the cluster's header in the .cluster file. This is a FASTA
jpayne@69	186 style header and lists the two sequences that produced the following alignments
jpayne@69	187 after a '>' and separated by a space, after the two sequences are the lengths
jpayne@69	188 of those sequences in the same order. An example header might look like:
jpayne@69	189
jpayne@69	190 >tagA1 tagB1 500 2000000
jpayne@69	191
jpayne@69	192 Following this sequence header is the alignment data. Each alignment region
jpayne@69	193 has a header that describes the start and end coordinates of the alignment in
jpayne@69	194 each sequence. These coordinates are inclusive and reference the forward strand
jpayne@69	195 of the current sequence. Thus, if the start coordinate is greater than the end
jpayne@69	196 coordinate, the alignment is on the reverse strand. The four digits are the
jpayne@69	197 start and end in the reference sequence respectively and the start and end in
jpayne@69	198 the query sequence respectively. These coordinates are ALWAYS measured in DNA
jpayne@69	199 bases regardless of the alignment data type. The three digits after the starts
jpayne@69	200 and stops are the number of errors (non-identities), similarity errors (non-
jpayne@69	201 positive match scores) and stop codons. An example header might look like:
jpayne@69	202
jpayne@69	203 2631 3401 2464 3234 15 15 2
jpayne@69	204
jpayne@69	205 Notice that the start coordinate points to the first base in the first codon,
jpayne@69	206 and the end coordinate points to the last base in the last codon. Therefore
jpayne@69	207 making (end - start + 1) % 3 = 0.
jpayne@69	208 Each of these headers is followed by a string of signed digits, one per line,
jpayne@69	209 with the final line before the next header equaling 0 (zero). Each digit
jpayne@69	210 represents the distance to the next insertion in the reference (positive int)
jpayne@69	211 or deletion in the reference (negative int), as measured in DNA bases OR amino
jpayne@69	212 acids depending on the alignment data type. For example, with 'promer' the
jpayne@69	213 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
jpayne@69	214 in the translated reference sequence and an insertion at position 3 in the
jpayne@69	215 translated query sequence.
jpayne@69	216 Or with letters:
jpayne@69	217
jpayne@69	218 A = VBPWVPBWPVP$
jpayne@69	219 B = BPPWVPWPVP$
jpayne@69	220 Delta = (1, -3, 4, 0)
jpayne@69	221 A = VBP.WVPBWPVP$
jpayne@69	222 B = .BPPWVP.WPVP$
jpayne@69	223
jpayne@69	224 Using this delta information, it is possible to re-generate the alignment
jpayne@69	225 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
jpayne@69	226 allows various utilities to be crafted to process and analyze the alignment
jpayne@69	227 data using a universal format. Below is what a .delta file might look like:
jpayne@69	228
jpayne@69	229 /home/username/reference.fasta /home/username/query.fasta
jpayne@69	230 PROMER
jpayne@69	231 >tagA1 tagB1 3000000 2000000
jpayne@69	232 1667803 1667078 1641506 1640769 14 7 2
jpayne@69	233 -145
jpayne@69	234 -3
jpayne@69	235 -1
jpayne@69	236 -40
jpayne@69	237 0
jpayne@69	238 1667804 1667079 1641507 1640770 10 5 3
jpayne@69	239 -146
jpayne@69	240 -1
jpayne@69	241 -1
jpayne@69	242 -34
jpayne@69	243 0
jpayne@69	244 >tagA2 tagB4 4000 3000
jpayne@69	245 2631 3401 2464 3234 4 0 0
jpayne@69	246 0
jpayne@69	247 2608 3402 2456 3235 10 5 0
jpayne@69	248 7
jpayne@69	249 1
jpayne@69	250 1
jpayne@69	251 1
jpayne@69	252 1
jpayne@69	253 0
jpayne@69	254
jpayne@69	255
jpayne@69	256
jpayne@69	257 * .cluster OUTPUT *
jpayne@69	258
jpayne@69	259 This output format is for debugging purposes and is now only available by
jpayne@69	260 using the -d switch for the 'postnuc' program.
jpayne@69	261
jpayne@69	262 This output file is a list of the match clusters that were generated by the
jpayne@69	263 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
jpayne@69	264 of the headers to be described later. 2 example rows could read:
jpayne@69	265
jpayne@69	266 1788 1622 59 - -
jpayne@69	267 1857 1691 23 10 10
jpayne@69	268
jpayne@69	269 Where the first column is the start coordinate of the match in the reference
jpayne@69	270 sequence, the second column is the start coordinate of the match in the query
jpayne@69	271 sequence, the third column is the length of the match, and the two final
jpayne@69	272 columns are the distance between the previous match's end and the current
jpayne@69	273 match's start (the gap distance). All coordinates reference the forward strand
jpayne@69	274 of each sequence, regardless of match direction, and are ALWAYS measured in
jpayne@69	275 DNA bases regardless of alignment data type (DNA or amino acid). Therefore,
jpayne@69	276 when running 'promer', all the numbers in the length column must be multiples
jpayne@69	277 of three.
jpayne@69	278 Each individual cluster is preceded by two digits (-1,-2,-3, 1, 2, 3). These
jpayne@69	279 two digits represent the reading frame of the cluster, either forward or
jpayne@69	280 reverse with offsets of 1,2 or 3. A " 3 -1" would represent a match on the
jpayne@69	281 forward 3rd reading frame in the reference and on the reverse 1st reading frame
jpayne@69	282 in the query sequence. Take note that since the match coordinates reference the
jpayne@69	283 forward DNA strand, forward matches will have ascending matches and a reverse
jpayne@69	284 matches will have descending matches. The reference may also be reversed in this
jpayne@69	285 file, so expect the first number to sometimes be negative.
jpayne@69	286 There are also 3 other types of headers. The first line of each .cluster
jpayne@69	287 file lists the two original input files separated by a space. The second line
jpayne@69	288 of each .cluster file lists the type of alignment data, either "NUCMER" or
jpayne@69	289 "PROMER". The third type of header resembles a FASTA header, and lists the
jpayne@69	290 two sequences that produced the following clusters after a '>' and their
jpayne@69	291 respective lengths separated by a whitespace. Note that each of these headers
jpayne@69	292 is unique, so all clusters/matches between any two sequences will appear under
jpayne@69	293 a single header identifying those two sequences. Below is a short example of
jpayne@69	294 what a .cluster file might look like:
jpayne@69	295
jpayne@69	296 /home/username/reference.fasta /home/username/query.fasta
jpayne@69	297 PROMER
jpayne@69	298 >tagA1 tagB1 1000 2000000
jpayne@69	299 1 3
jpayne@69	300 184 18 21 - -
jpayne@69	301 223 57 123 18 18
jpayne@69	302 3 2
jpayne@69	303 168 2 30 - -
jpayne@69	304 288 122 51 90 90
jpayne@69	305 354 188 84 15 15
jpayne@69	306 483 317 24 45 45
jpayne@69	307 558 392 81 51 51
jpayne@69	308 642 476 144 3 3
jpayne@69	309 >tagA2 tagB1 2000000 2000000
jpayne@69	310 -3 -2
jpayne@69	311 1665663 1641799 18 - -
jpayne@69	312 1665585 1641712 21 60 69
jpayne@69	313 1665546 1641673 39 18 18
jpayne@69	314

Mercurial > repos > rliterman > csp2

annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/promer.README @ 69:33d812a61356