jpayne@69: -------------------------------------------------------------------------------- jpayne@69: PROmer3.0: jpayne@69: An extension of the MUMmer package that calculates alignments jpayne@69: between two DNA multi-fasta files using all 6 translated amino acid jpayne@69: reading frames. jpayne@69: jpayne@69: Use Cases: jpayne@69: + comparing two fairly divergent genomes that have large rearrangements jpayne@69: and may only be similar on the protein level jpayne@69: + comparative genome annotation, i.e. using an already annotated genome jpayne@69: to help in the annotation of a newly sequenced genome jpayne@69: + identifying syntenic regions between highly divergent genomes jpayne@69: jpayne@69: If any of this code is used in any publication, please cite the following: jpayne@69: jpayne@69: Versatile and open software for comparing large genomes. jpayne@69: S. Kurtz, A. Phillippy, A.L. Delcher, jpayne@69: M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg. jpayne@69: Genome Biology (2004), 5:R12. jpayne@69: jpayne@69: -------------------------------------------------------------------------------- jpayne@69: jpayne@69: ** NOTE ** jpayne@69: This manual is outdated, please refer to the HTML documentation included in jpayne@69: this distribution or at: jpayne@69: jpayne@69: http://mummer.sourceforge.net jpayne@69: http://mummer.sourceforge.net/manual jpayne@69: http://mummer.sourceforge.net/examples jpayne@69: jpayne@69: -- DESCRIPTION -- jpayne@69: PROmer3.0 (PROtein MUMmer) is a suite of programs to modify and refine jpayne@69: the basic output of the MUMmer3.0 matching program 'mummer'. PROmer pre- jpayne@69: processes the DNA multi-FASTA input files, and translates them in all 6 jpayne@69: amino acid reading frames so that they can be examined by the match finding jpayne@69: routine. After which, the matches are clustered and the matches within jpayne@69: clusters are extended via Smith-Waterman techniques in order to expand jpayne@69: the total alignment coverage and close the gaps between clustered MUMs. The jpayne@69: "out.delta" file contains the final alignment data, encoded jpayne@69: with a style called delta encoding. Any of the 'show-*' programs are able to jpayne@69: parse this file and present its information in a human readable format. jpayne@69: jpayne@69: jpayne@69: -- PROmer3.0 EXAMPLE -- jpayne@69: To compare two eukaryotic genomes, genome1.fasta and genome2.fasta, jpayne@69: (all chromosomes vs all chromosomes) type: jpayne@69: jpayne@69: "promer -o -p output genome1.fasta genome2.fasta" jpayne@69: jpayne@69: Output will be... jpayne@69: output.delta // alignment data encoded with delta encoding jpayne@69: output.coords // list of alignments, % identity, etc... jpayne@69: jpayne@69: To generate more output, investigate the options of any of the 'show-*' jpayne@69: programs, these programs can interpret the .delta output of PROmer and provide jpayne@69: useful information regarding the alignment. In addition, dotplots can be jpayne@69: generated (if you have gnuplot installed) via the 'mummerplot' script. Also, jpayne@69: the 'delta-filter' utility is very useful for removing chance and repeat-induced jpayne@69: alignments. It can significantly reduce the number of alignments in the nucmer jpayne@69: output, making it easier to interpret (see html manual for more information). jpayne@69: jpayne@69: jpayne@69: -- RUNNING 'promer' -- jpayne@69: jpayne@69: USAGE: promer [options] jpayne@69: jpayne@69: jpayne@69: MANDATORY: jpayne@69: Reference Set the input reference multi-FASTA DNA file to "Reference" jpayne@69: Query Set the input query multi-FASTA DNA file to "Query" jpayne@69: jpayne@69: jpayne@69: OPTIONS: jpayne@69: --mum Use only maximal exact matches that are unique in both the jpayne@69: query and reference sequences as the alignment anchors. jpayne@69: jpayne@69: --mumreference Use only maximal exact matches that are unique in the jpayne@69: reference sequences as the alignment anchors. jpayne@69: jpayne@69: --maxmatch Use all maximal exact matches as the alignment anchors. jpayne@69: jpayne@69: -b breakLen Set the distance an alignment extension will attempt to jpayne@69: extend poor scoring regions before giving up. The default jpayne@69: distance is 60. This distance should be measured in amino jpayne@69: acids, and it effects the tolerance to error of the jpayne@69: alignment extensions. A higher value will result in greater jpayne@69: tolerance to error in hopes of finding good alignments on jpayne@69: the other side of a poorly scoring region. jpayne@69: jpayne@69: -c|mincluster Sets the minimum length of a cluster. The default value is jpayne@69: 20. This length should be measured in amino acids, and the jpayne@69: length of a match cluster is determined by the sum of the jpayne@69: lengths of the matches within. A higher value will decrease jpayne@69: the sensitivity of the alignment, but will also result in jpayne@69: more confident results. jpayne@69: jpayne@69: --[no]delta Toggles the creation of the delta file. The default jpayne@69: behavior is --delta, but disabling the delta file will jpayne@69: speed up the finishing stage by not creating alignments. jpayne@69: This option implies --noextend. jpayne@69: jpayne@69: --depend Print the dependency information and exit. jpayne@69: jpayne@69: -d|diagfactor Set the clustering fraction of separation for diagonal jpayne@69: difference. The default value is .11. A higher value will jpayne@69: increase the tolerance of the clustering algorithm and jpayne@69: allow for more indels in a cluster. jpayne@69: jpayne@69: --[no]extend Toggles the outward extension of alignments from their jpayne@69: anchoring clusters. The default behavior is --extend, but jpayne@69: disabling the extensions will speed up the finishing stage jpayne@69: by not extending alignments. Clusters will still be fused jpayne@69: into alignments, but they will not be expanded outward. jpayne@69: jpayne@69: -g|maxgap Set the maximum gap between two adjacent matches in a jpayne@69: cluster. The default value is 30. This gap distance should jpayne@69: be measured in amino acids. A smaller value will result in jpayne@69: smaller (but more) clusters, a larger value will result in jpayne@69: larger (but fewer) clusters. jpayne@69: jpayne@69: -h jpayne@69: --help Display help information and exit. jpayne@69: jpayne@69: -l|minmatch Set the minimum length of a single match. The default value jpayne@69: is 6. This value should be measured in amino acids. jpayne@69: Reducing this value will possibly increase the sensitivity jpayne@69: of the alignment, but it will also allow for chance or jpayne@69: "noise" matches. Take note that lowering this value will jpayne@69: significantly increase runtime. jpayne@69: jpayne@69: -o jpayne@69: -coords Automatically generate the "out.coords" file using the jpayne@69: 'show-coords' program. This file lists all the alignments jpayne@69: sorted by their reference coordinate in a user friendly jpayne@69: format, without requiring the user to run 'show-coords' jpayne@69: independently of promer. jpayne@69: jpayne@69: --[no]optimize Toggle alignment score optimization, i.e. if an alignment jpayne@69: extension reaches the end of a sequence, it will backtrack jpayne@69: to optimize the alignment score instead of terminating the jpayne@69: alignment at the end of the sequence. By turning this jpayne@69: option off, alignments within -b AAs of the sequence end jpayne@69: will be forced to extend to the end. Default behavior is jpayne@69: --optimize, --nooptimize will result in longer alignments jpayne@69: but may lead to lower alignment scores. jpayne@69: jpayne@69: -p|prefix Set the prefix of the output files. The default prefix is jpayne@69: "out". Take note that promer will allow the user to jpayne@69: overwrite existing files, so a unique prefix should be used jpayne@69: for each subsequent run of promer to avoid data loss. jpayne@69: jpayne@69: -V jpayne@69: --version Display the version information and exit jpayne@69: jpayne@69: -x|matrix Set the BLOSUM matrix number. The default jpayne@69: value is "2" (BLOSUM 62), other available choices include jpayne@69: "1" (BLOSUM 45) and "3" (BLOSUM 80). jpayne@69: jpayne@69: jpayne@69: -- NOTES -- jpayne@69: When comparing two entire genomes, it is very helpful to mask the jpayne@69: "uninteresting" regions of input using a utility such as "nseg" or "dust". jpayne@69: This will allow the program to focus solely on aligning the regions of jpayne@69: interest. All unrecognized codons will not be matched, so most any masking jpayne@69: character is appropriate, we recommend 'N' or 'X'. jpayne@69: Since 'promer' runs so quickly, it can be useful to run it numerous times jpayne@69: with different parameters to fine-tune the resulting alignment and include or jpayne@69: exclude missed or chance matches. It is also helpful to try the different jpayne@69: uniqueness switches to attain the appropriate level of detail in the resulting jpayne@69: output. jpayne@69: jpayne@69: jpayne@69: jpayne@69: -- OUTPUT FILES -- jpayne@69: jpayne@69: *** .delta OUTPUT *** jpayne@69: jpayne@69: This output file is a representation of the all-vs-all alignment between jpayne@69: the sequences contained in the multi-FASTA input files. It catalogs the jpayne@69: coordinates of aligned regions and the distance between insertions and deletions jpayne@69: contained in these alignment regions. The first two lines of the file are jpayne@69: identical to the .cluster output. The first line lists the two original input jpayne@69: files separated by a space, and the second line specifies the alignment data jpayne@69: type, either "NUCMER" or "PROMER". Every grouping of alignment regions have jpayne@69: a header, just like the cluster's header in the .cluster file. This is a FASTA jpayne@69: style header and lists the two sequences that produced the following alignments jpayne@69: after a '>' and separated by a space, after the two sequences are the lengths jpayne@69: of those sequences in the same order. An example header might look like: jpayne@69: jpayne@69: >tagA1 tagB1 500 2000000 jpayne@69: jpayne@69: Following this sequence header is the alignment data. Each alignment region jpayne@69: has a header that describes the start and end coordinates of the alignment in jpayne@69: each sequence. These coordinates are inclusive and reference the forward strand jpayne@69: of the current sequence. Thus, if the start coordinate is greater than the end jpayne@69: coordinate, the alignment is on the reverse strand. The four digits are the jpayne@69: start and end in the reference sequence respectively and the start and end in jpayne@69: the query sequence respectively. These coordinates are ALWAYS measured in DNA jpayne@69: bases regardless of the alignment data type. The three digits after the starts jpayne@69: and stops are the number of errors (non-identities), similarity errors (non- jpayne@69: positive match scores) and stop codons. An example header might look like: jpayne@69: jpayne@69: 2631 3401 2464 3234 15 15 2 jpayne@69: jpayne@69: Notice that the start coordinate points to the first base in the first codon, jpayne@69: and the end coordinate points to the last base in the last codon. Therefore jpayne@69: making (end - start + 1) % 3 = 0. jpayne@69: Each of these headers is followed by a string of signed digits, one per line, jpayne@69: with the final line before the next header equaling 0 (zero). Each digit jpayne@69: represents the distance to the next insertion in the reference (positive int) jpayne@69: or deletion in the reference (negative int), as measured in DNA bases OR amino jpayne@69: acids depending on the alignment data type. For example, with 'promer' the jpayne@69: delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7 jpayne@69: in the translated reference sequence and an insertion at position 3 in the jpayne@69: translated query sequence. jpayne@69: Or with letters: jpayne@69: jpayne@69: A = VBPWVPBWPVP$ jpayne@69: B = BPPWVPWPVP$ jpayne@69: Delta = (1, -3, 4, 0) jpayne@69: A = VBP.WVPBWPVP$ jpayne@69: B = .BPPWVP.WPVP$ jpayne@69: jpayne@69: Using this delta information, it is possible to re-generate the alignment jpayne@69: calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This jpayne@69: allows various utilities to be crafted to process and analyze the alignment jpayne@69: data using a universal format. Below is what a .delta file might look like: jpayne@69: jpayne@69: /home/username/reference.fasta /home/username/query.fasta jpayne@69: PROMER jpayne@69: >tagA1 tagB1 3000000 2000000 jpayne@69: 1667803 1667078 1641506 1640769 14 7 2 jpayne@69: -145 jpayne@69: -3 jpayne@69: -1 jpayne@69: -40 jpayne@69: 0 jpayne@69: 1667804 1667079 1641507 1640770 10 5 3 jpayne@69: -146 jpayne@69: -1 jpayne@69: -1 jpayne@69: -34 jpayne@69: 0 jpayne@69: >tagA2 tagB4 4000 3000 jpayne@69: 2631 3401 2464 3234 4 0 0 jpayne@69: 0 jpayne@69: 2608 3402 2456 3235 10 5 0 jpayne@69: 7 jpayne@69: 1 jpayne@69: 1 jpayne@69: 1 jpayne@69: 1 jpayne@69: 0 jpayne@69: jpayne@69: jpayne@69: jpayne@69: *** .cluster OUTPUT *** jpayne@69: jpayne@69: This output format is for debugging purposes and is now only available by jpayne@69: using the -d switch for the 'postnuc' program. jpayne@69: jpayne@69: This output file is a list of the match clusters that were generated by the jpayne@69: 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception jpayne@69: of the headers to be described later. 2 example rows could read: jpayne@69: jpayne@69: 1788 1622 59 - - jpayne@69: 1857 1691 23 10 10 jpayne@69: jpayne@69: Where the first column is the start coordinate of the match in the reference jpayne@69: sequence, the second column is the start coordinate of the match in the query jpayne@69: sequence, the third column is the length of the match, and the two final jpayne@69: columns are the distance between the previous match's end and the current jpayne@69: match's start (the gap distance). All coordinates reference the forward strand jpayne@69: of each sequence, regardless of match direction, and are ALWAYS measured in jpayne@69: DNA bases regardless of alignment data type (DNA or amino acid). Therefore, jpayne@69: when running 'promer', all the numbers in the length column must be multiples jpayne@69: of three. jpayne@69: Each individual cluster is preceded by two digits (-1,-2,-3, 1, 2, 3). These jpayne@69: two digits represent the reading frame of the cluster, either forward or jpayne@69: reverse with offsets of 1,2 or 3. A " 3 -1" would represent a match on the jpayne@69: forward 3rd reading frame in the reference and on the reverse 1st reading frame jpayne@69: in the query sequence. Take note that since the match coordinates reference the jpayne@69: forward DNA strand, forward matches will have ascending matches and a reverse jpayne@69: matches will have descending matches. The reference may also be reversed in this jpayne@69: file, so expect the first number to sometimes be negative. jpayne@69: There are also 3 other types of headers. The first line of each .cluster jpayne@69: file lists the two original input files separated by a space. The second line jpayne@69: of each .cluster file lists the type of alignment data, either "NUCMER" or jpayne@69: "PROMER". The third type of header resembles a FASTA header, and lists the jpayne@69: two sequences that produced the following clusters after a '>' and their jpayne@69: respective lengths separated by a whitespace. Note that each of these headers jpayne@69: is unique, so all clusters/matches between any two sequences will appear under jpayne@69: a single header identifying those two sequences. Below is a short example of jpayne@69: what a .cluster file might look like: jpayne@69: jpayne@69: /home/username/reference.fasta /home/username/query.fasta jpayne@69: PROMER jpayne@69: >tagA1 tagB1 1000 2000000 jpayne@69: 1 3 jpayne@69: 184 18 21 - - jpayne@69: 223 57 123 18 18 jpayne@69: 3 2 jpayne@69: 168 2 30 - - jpayne@69: 288 122 51 90 90 jpayne@69: 354 188 84 15 15 jpayne@69: 483 317 24 45 45 jpayne@69: 558 392 81 51 51 jpayne@69: 642 476 144 3 3 jpayne@69: >tagA2 tagB1 2000000 2000000 jpayne@69: -3 -2 jpayne@69: 1665663 1641799 18 - - jpayne@69: 1665585 1641712 21 60 69 jpayne@69: 1665546 1641673 39 18 18 jpayne@69: