Mercurial > repos > rliterman > csp2

--------------------------------------------------------------------------------
PROmer3.0:
        An extension of the MUMmer package that calculates alignments
        between two DNA multi-fasta files using all 6 translated amino acid
        reading frames.

Use Cases:
        + comparing two fairly divergent genomes that have large rearrangements
          and may only be similar on the protein level
        + comparative genome annotation, i.e. using an already annotated genome
          to help in the annotation of a newly sequenced genome
        + identifying syntenic regions between highly divergent genomes

If any of this code is used in any publication, please cite the following:

  Versatile and open software for comparing large genomes.
  S. Kurtz, A. Phillippy, A.L. Delcher,
  M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
  Genome Biology (2004), 5:R12.

--------------------------------------------------------------------------------

** NOTE **
This manual is outdated, please refer to the HTML documentation included in
this distribution or at:

   http://mummer.sourceforge.net
   http://mummer.sourceforge.net/manual
   http://mummer.sourceforge.net/examples

-- DESCRIPTION --
   PROmer3.0 (PROtein MUMmer) is a suite of programs to modify and refine
the basic output of the MUMmer3.0 matching program 'mummer'. PROmer pre-
processes the DNA multi-FASTA input files, and translates them in all 6
amino acid reading frames so that they can be examined by the match finding
routine. After which, the matches are clustered and the matches within
clusters are extended via Smith-Waterman techniques in order to expand
the total alignment coverage and close the gaps between clustered MUMs. The
"out.delta" file contains the final alignment data, encoded
with a style called delta encoding. Any of the 'show-*' programs are able to
parse this file and present its information in a human readable format.


-- PROmer3.0 EXAMPLE --
   To compare two eukaryotic genomes, genome1.fasta and genome2.fasta,
(all chromosomes vs all chromosomes) type:

"promer -o -p output genome1.fasta genome2.fasta"

Output will be...
   output.delta       // alignment data encoded with delta encoding
   output.coords      // list of alignments, % identity, etc...

To generate more output, investigate the options of any of the 'show-*'
programs, these programs can interpret the .delta output of PROmer and provide
useful information regarding the alignment. In addition, dotplots can be
generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
the 'delta-filter' utility is very useful for removing chance and repeat-induced
alignments. It can significantly reduce the number of alignments in the nucmer
output, making it easier to interpret (see html manual for more information).


-- RUNNING 'promer' --

  USAGE: promer  [options]  <Reference>  <Query>


  MANDATORY:
    Reference       Set the input reference multi-FASTA DNA file to "Reference"
    Query           Set the input query multi-FASTA DNA file to "Query"


  OPTIONS:
    --mum           Use only maximal exact matches that are unique in both the
                    query and reference sequences as the alignment anchors.

    --mumreference  Use only maximal exact matches that are unique in the
                    reference sequences as the alignment anchors.

    --maxmatch      Use all maximal exact matches as the alignment anchors.

    -b breakLen     Set the distance an alignment extension will attempt to
                    extend poor scoring regions before giving up. The default
                    distance is 60. This distance should be measured in amino
                    acids, and it effects the tolerance to error of the
                    alignment extensions. A higher value will result in greater
                    tolerance to error in hopes of finding good alignments on
                    the other side of a poorly scoring region.

    -c|mincluster   Sets the minimum length of a cluster. The default value is
                    20. This length should be measured in amino acids, and the
                    length of a match cluster is determined by the sum of the
                    lengths of the matches within. A higher value will decrease
                    the sensitivity of the alignment, but will also result in
                    more confident results.

    --[no]delta     Toggles the creation of the delta file. The default
                    behavior is --delta, but disabling the delta file will
                    speed up the finishing stage by not creating alignments.
                    This option implies --noextend.

    --depend        Print the dependency information and exit.

    -d|diagfactor   Set the clustering fraction of separation for diagonal
                    difference. The default value is .11. A higher value will
                    increase the tolerance of the clustering algorithm and
                    allow for more indels in a cluster.

    --[no]extend    Toggles the outward extension of alignments from their
                    anchoring clusters. The default behavior is --extend, but
                    disabling the extensions will speed up the finishing stage
                    by not extending alignments. Clusters will still be fused
                    into alignments, but they will not be expanded outward.

    -g|maxgap       Set the maximum gap between two adjacent matches in a
                    cluster. The default value is 30. This gap distance should
                    be measured in amino acids. A smaller value will result in
                    smaller (but more) clusters, a larger value will result in
                    larger (but fewer) clusters.

    -h
    --help          Display help information and exit.

    -l|minmatch     Set the minimum length of a single match. The default value
                    is 6. This value should be measured in amino acids.
                    Reducing this value will possibly increase the sensitivity
                    of the alignment, but it will also allow for chance or
                    "noise" matches. Take note that lowering this value will
                    significantly increase runtime.

    -o
    -coords         Automatically generate the "out.coords" file using the
                    'show-coords' program. This file lists all the alignments
                    sorted by their reference coordinate in a user friendly
                    format, without requiring the user to run 'show-coords'
                    independently of promer.

    --[no]optimize  Toggle alignment score optimization, i.e. if an alignment
                    extension reaches the end of a sequence, it will backtrack
                    to optimize the alignment score instead of terminating the
                    alignment at the end of the sequence. By turning this
                    option off, alignments within -b AAs of the sequence end
                    will be forced to extend to the end. Default behavior is
                    --optimize, --nooptimize will result in longer alignments
                    but may lead to lower alignment scores.

    -p|prefix       Set the prefix of the output files. The default prefix is
                    "out". Take note that promer will allow the user to
                    overwrite existing files, so a unique prefix should be used
                    for each subsequent run of promer to avoid data loss.

    -V
    --version       Display the version information and exit

    -x|matrix       Set the BLOSUM matrix number. The default
                    value is "2" (BLOSUM 62), other available choices include
                    "1" (BLOSUM 45) and "3" (BLOSUM 80).


-- NOTES --
   When comparing two entire genomes, it is very helpful to mask the
"uninteresting" regions of input using a utility such as "nseg" or "dust".
This will allow the program to focus solely on aligning the regions of
interest. All unrecognized codons will not be matched, so most any masking
character is appropriate, we recommend 'N' or 'X'.
   Since 'promer' runs so quickly, it can be useful to run it numerous times
with different parameters to fine-tune the resulting alignment and include or
exclude missed or chance matches. It is also helpful to try the different
uniqueness switches to attain the appropriate level of detail in the resulting
output.


-- OUTPUT FILES --

 *** .delta OUTPUT ***

   This output file is a representation of the all-vs-all alignment between
the sequences contained in the multi-FASTA input files. It catalogs the
coordinates of aligned regions and the distance between insertions and deletions
contained in these alignment regions. The first two lines of the file are
identical to the .cluster output. The first line lists the two original input
files separated by a space, and the second line specifies the alignment data
type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
a header, just like the cluster's header in the .cluster file. This is a FASTA
style header and lists the two sequences that produced the following alignments
after a '>' and separated by a space, after the two sequences are the lengths
of those sequences in the same order. An example header might look like:

>tagA1 tagB1 500 2000000

   Following this sequence header is the alignment data. Each alignment region
has a header that describes the start and end coordinates of the alignment in
each sequence. These coordinates are inclusive and reference the forward strand
of the current sequence. Thus, if the start coordinate is greater than the end
coordinate, the alignment is on the reverse strand. The four digits are the
start and end in the reference sequence respectively and the start and end in
the query sequence respectively. These coordinates are ALWAYS measured in DNA
bases regardless of the alignment data type. The three digits after the starts
and stops are the number of errors (non-identities), similarity errors (non-
positive match scores) and stop codons. An example header might look like:

2631 3401 2464 3234 15 15 2

Notice that the start coordinate points to the first base in the first codon,
and the end coordinate points to the last base in the last codon. Therefore
making (end - start + 1) % 3 = 0.
   Each of these headers is followed by a string of signed digits, one per line,
with the final line before the next header equaling 0 (zero). Each digit
represents the distance to the next insertion in the reference (positive int)
or deletion in the reference (negative int), as measured in DNA bases OR amino
acids depending on the alignment data type. For example, with 'promer' the
delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
in the translated reference sequence and an insertion at position 3 in the
translated query sequence.
Or with letters:

A = VBPWVPBWPVP$
B = BPPWVPWPVP$
Delta = (1, -3, 4, 0)
A = VBP.WVPBWPVP$
B = .BPPWVP.WPVP$

   Using this delta information, it is possible to re-generate the alignment
calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
allows various utilities to be crafted to process and analyze the alignment
data using a universal format. Below is what a .delta file might look like:

/home/username/reference.fasta /home/username/query.fasta
PROMER
>tagA1 tagB1 3000000 2000000
1667803 1667078 1641506 1640769 14 7 2
-145
-3
-1
-40
0
1667804 1667079 1641507 1640770 10 5 3
-146
-1
-1
-34
0
>tagA2 tagB4 4000 3000
2631 3401 2464 3234 4 0 0
0
2608 3402 2456 3235 10 5 0
7
1
1
1
1
0


 *** .cluster OUTPUT ***

This output format is for debugging purposes and is now only available by
using the -d switch for the 'postnuc' program.

   This output file is a list of the match clusters that were generated by the
'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
of the headers to be described later. 2 example rows could read:

    1788     1622     59     -      -
    1857     1691     23    10     10

   Where the first column is the start coordinate of the match in the reference
sequence, the second column is the start coordinate of the match in the query
sequence, the third column is the length of the match, and the two final
columns are the distance between the previous match's end and the current
match's start (the gap distance). All coordinates reference the forward strand
of each sequence, regardless of match direction, and are ALWAYS measured in
DNA bases regardless of alignment data type (DNA or amino acid). Therefore,
when running 'promer', all the numbers in the length column must be multiples
of three.
   Each individual cluster is preceded by two digits (-1,-2,-3, 1, 2, 3). These
two digits represent the reading frame of the cluster, either forward or
reverse with offsets of 1,2 or 3. A " 3 -1" would represent a match on the
forward 3rd reading frame in the reference and on the reverse 1st reading frame
in the query sequence. Take note that since the match coordinates reference the
forward DNA strand, forward matches will have ascending matches and a reverse
matches will have descending matches. The reference may also be reversed in this
file, so expect the first number to sometimes be negative.
   There are also 3 other types of headers. The first line of each .cluster
file lists the two original input files separated by a space. The second line
of each .cluster file lists the type of alignment data, either "NUCMER" or
"PROMER". The third type of header resembles a FASTA header, and lists the
two sequences that produced the following clusters after a '>' and their
respective lengths separated by a whitespace. Note that each of these headers
is unique, so all clusters/matches between any two sequences will appear under
a single header identifying those two sequences. Below is a short example of
what a .cluster file might look like:

/home/username/reference.fasta /home/username/query.fasta
PROMER
>tagA1 tagB1 1000 2000000
 1  3
     184       18     21     -      -
     223       57    123    18     18
 3  2
     168        2     30     -      -
     288      122     51    90     90
     354      188     84    15     15
     483      317     24    45     45
     558      392     81    51     51
     642      476    144     3      3
>tagA2 tagB1 2000000 2000000
-3 -2
 1665663  1641799     18     -      -
 1665585  1641712     21    60     69
 1665546  1641673     39    18     18
author	jpayne
date	Tue, 18 Mar 2025 17:55:14 -0400
parents
children