annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/nucmer.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
rev   line source
jpayne@69 1 --------------------------------------------------------------------------------
jpayne@69 2 NUCmer3.0:
jpayne@69 3 An extension of the MUMmer package that calculates alignments
jpayne@69 4 between two DNA multi-fasta files using the raw DNA sequence.
jpayne@69 5
jpayne@69 6 Use Cases:
jpayne@69 7 + aligning two unfinished shotgun sequencing assemblies
jpayne@69 8 + aligning an unfinished sequencing assembly to a finished genome
jpayne@69 9 + comparing two fairly similar genomes that have large rearrangements
jpayne@69 10
jpayne@69 11 If any of this code is used in any publication, please cite the following:
jpayne@69 12
jpayne@69 13 Versatile and open software for comparing large genomes.
jpayne@69 14 S. Kurtz, A. Phillippy, A.L. Delcher,
jpayne@69 15 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
jpayne@69 16 Genome Biology (2004), 5:R12.
jpayne@69 17
jpayne@69 18 --------------------------------------------------------------------------------
jpayne@69 19
jpayne@69 20 ** NOTE **
jpayne@69 21 This manual is outdated, please refer to the HTML documentation included in
jpayne@69 22 this distribution or at:
jpayne@69 23
jpayne@69 24 http://mummer.sourceforge.net
jpayne@69 25 http://mummer.sourceforge.net/manual
jpayne@69 26 http://mummer.sourceforge.net/examples
jpayne@69 27
jpayne@69 28 -- DESCRIPTION --
jpayne@69 29 NUCmer3.0 (NUCleotide MUMmer) is a suite of programs to modify and refine
jpayne@69 30 the basic output of the MUMmer3.0 matching program 'mummer'. NUCmer pre-
jpayne@69 31 processes the DNA multi-FASTA input files so that they can be examined by the
jpayne@69 32 match finding routine. After which, the matches are clustered and the matches
jpayne@69 33 within clusters are extended via Smith-Waterman techniques in order to expand
jpayne@69 34 the total alignment coverage and close the gaps between clustered MUMs. The
jpayne@69 35 "out.delta" output file contains the final alignment data, encoded
jpayne@69 36 with a style called delta encoding. Any of the 'show-*' programs are able to
jpayne@69 37 parse this file and present its information in a human readable format.
jpayne@69 38
jpayne@69 39
jpayne@69 40 -- NUCmer3.0 EXAMPLE --
jpayne@69 41 To compare a set of assembly contigs "asmbl.fasta" to an already completed,
jpayne@69 42 related genome "genome.fasta" type:
jpayne@69 43
jpayne@69 44 "nucmer -o -p output genome.fasta asmbl.fasta"
jpayne@69 45
jpayne@69 46 Output will be...
jpayne@69 47 output.delta // alignment data encoded with delta encoding
jpayne@69 48 output.coords // list of alignments, % identity, etc...
jpayne@69 49
jpayne@69 50 To generate more output, investigate the options of any of the 'show-*'
jpayne@69 51 programs, these programs can interpret the .delta output of NUCmer and provide
jpayne@69 52 useful information regarding the alignment. In addition, dotplots can be
jpayne@69 53 generated (if you have gnuplot installed) via the 'mummerplot' script. Also,
jpayne@69 54 the 'delta-filter' utility is very useful for removing chance and repeat-induced
jpayne@69 55 alignments. It can significantly reduce the number of alignments in the nucmer
jpayne@69 56 output, making it easier to interpret (see html manual for more information).
jpayne@69 57
jpayne@69 58
jpayne@69 59 -- RUNNING 'nucmer' --
jpayne@69 60
jpayne@69 61 USAGE: nucmer [options] <Reference> <Query>
jpayne@69 62
jpayne@69 63
jpayne@69 64 MANDATORY:
jpayne@69 65 Reference Set the input reference multi-FASTA file to "Reference"
jpayne@69 66 Query Set the input query multi-FASTA file to "Query"
jpayne@69 67
jpayne@69 68
jpayne@69 69 OPTIONS:
jpayne@69 70 --mum Use only maximal exact matches that are unique in both the
jpayne@69 71 query and reference sequences as the alignment anchors.
jpayne@69 72
jpayne@69 73 --mumreference Use only maximal exact matches that are unique in the
jpayne@69 74 reference sequences as the alignment anchors.
jpayne@69 75
jpayne@69 76 --maxmatch Use all maximal exact matches as the alignment anchors.
jpayne@69 77
jpayne@69 78 -b|breakLen Set the distance an alignment extension will attempt to
jpayne@69 79 extend poor scoring regions before giving up. The default
jpayne@69 80 distance is 200. This distance should be measured in DNA
jpayne@69 81 bases, and it effects the tolerance to error of the
jpayne@69 82 alignment extensions. A higher value will result in greater
jpayne@69 83 tolerance to error in hopes of finding good alignments on
jpayne@69 84 the other side of a poorly scoring region.
jpayne@69 85
jpayne@69 86 -c|mincluster Sets the minimum length of a cluster. The default value is
jpayne@69 87 65. The length of a match cluster is determined by the sum
jpayne@69 88 of the lengths of the matches within. A higher value will
jpayne@69 89 decrease the sensitivity of the alignment, but will also
jpayne@69 90 result in more confident results.
jpayne@69 91
jpayne@69 92 --[no]delta Toggles the creation of the delta file. The default
jpayne@69 93 behavior is --delta, but disabling the delta file will
jpayne@69 94 speed up the finishing stage by not creating alignments.
jpayne@69 95 This option implies --noextend.
jpayne@69 96
jpayne@69 97 --depend Print the dependency information and exit.
jpayne@69 98
jpayne@69 99 -d|diagfactor Set the clustering fraction of separation for diagonal
jpayne@69 100 difference. The default value is .12. A higher value will
jpayne@69 101 increase the tolerance of the clustering algorithm and
jpayne@69 102 allow for more indels in a cluster.
jpayne@69 103
jpayne@69 104 --[no]extend Toggles the outward extension of alignments from their
jpayne@69 105 anchoring clusters. The default behavior is --extend, but
jpayne@69 106 disabling the extensions will speed up the finishing stage
jpayne@69 107 by not extending alignments. Clusters will still be fused
jpayne@69 108 into alignments, but they will not be expanded outward.
jpayne@69 109
jpayne@69 110 -f
jpayne@69 111 --forward Use only the forward strand of the Query sequences. The
jpayne@69 112 default behavior is to use both the forward and reverse
jpayne@69 113 strands.
jpayne@69 114
jpayne@69 115 -g|maxgap Set the maximum gap between two adjacent matches in a
jpayne@69 116 cluster. The default value is 90. A smaller value will
jpayne@69 117 result in smaller (but more) clusters, a larger value will
jpayne@69 118 result in larger (but fewer) clusters.
jpayne@69 119
jpayne@69 120 -h
jpayne@69 121 --help Display help information and exit.
jpayne@69 122
jpayne@69 123 -l|minmatch Set the minimum length of a single match. The default value
jpayne@69 124 is 20. Reducing this value will possibly increase the
jpayne@69 125 sensitivity of the alignment, but it will also allow for
jpayne@69 126 chance or "noise" matches. Take note that lowering this
jpayne@69 127 value will significantly increase runtime.
jpayne@69 128
jpayne@69 129 -o
jpayne@69 130 --coords Automatically generate the "out.coords" file using the
jpayne@69 131 'show-coords' program. This file lists all the alignments
jpayne@69 132 sorted by their reference coordinate in a user friendly
jpayne@69 133 format, without requiring the user to run 'show-coords'
jpayne@69 134 independently of nucmer.
jpayne@69 135
jpayne@69 136 --[no]optimize Toggle alignment score optimization, i.e. if an alignment
jpayne@69 137 extension reaches the end of a sequence, it will backtrack
jpayne@69 138 to optimize the alignment score instead of terminating the
jpayne@69 139 alignment at the end of the sequence. By turning this
jpayne@69 140 option off, alignments within -b bases of the sequence end
jpayne@69 141 will be forced to extend to the end. Default behavior is
jpayne@69 142 --optimize, --nooptimize will result in longer alignments
jpayne@69 143 but may lead to lower alignment scores.
jpayne@69 144
jpayne@69 145 -p|prefix Set the prefix of the output files. The default prefix is
jpayne@69 146 "out". Take note that nucmer will allow the user to
jpayne@69 147 overwrite existing files, so a unique prefix should be used
jpayne@69 148 for each subsequent run of nucmer to avoid data loss.
jpayne@69 149
jpayne@69 150 -r
jpayne@69 151 --reverse Use only the reverse complement of the Query sequences. The
jpayne@69 152 default behavior is to use both the forward and reverse
jpayne@69 153 strands.
jpayne@69 154
jpayne@69 155 --[no]simplify Simplify alignments by removing shadowed clusters. This
jpayne@69 156 is the default behavior, however it can be turned off if a
jpayne@69 157 sequence is being aligned to itself in order to find inexact
jpayne@69 158 repeats.
jpayne@69 159
jpayne@69 160 -V
jpayne@69 161 --version Display the version information and exit
jpayne@69 162
jpayne@69 163
jpayne@69 164
jpayne@69 165 -- NOTES --
jpayne@69 166 When comparing two entire genomes, it is very helpful to mask the
jpayne@69 167 "uninteresting" regions of input using a utility such as "nseg" or "dust".
jpayne@69 168 This will allow the program to focus solely on aligning the regions of
jpayne@69 169 interest. Since only ACGT's will be matched, any other alpha character used
jpayne@69 170 to mask the sequence will not be matched.
jpayne@69 171 Since NUCmer runs so quickly, it can be useful to run it numerous times
jpayne@69 172 with different parameters to fine-tune the resulting alignment and include or
jpayne@69 173 exclude missed or chance matches. It is also helpful to try the different
jpayne@69 174 uniqueness switches to attain the appropriate level of detail in the resulting
jpayne@69 175 output.
jpayne@69 176
jpayne@69 177
jpayne@69 178
jpayne@69 179 -- OUTPUT FILES --
jpayne@69 180
jpayne@69 181 *** .delta OUTPUT ***
jpayne@69 182
jpayne@69 183 This output file is a representation of the all-vs-all alignment between
jpayne@69 184 the sequences contained in the multi-FASTA input files. It catalogs the
jpayne@69 185 coordinates of aligned regions and the distance between insertions and deletions
jpayne@69 186 contained in these alignment regions. The first two lines of the file are
jpayne@69 187 identical to the .cluster output. The first line lists the two original input
jpayne@69 188 files separated by a space, and the second line specifies the alignment data
jpayne@69 189 type, either "NUCMER" or "PROMER". Every grouping of alignment regions have
jpayne@69 190 a header, just like the cluster's header in the .cluster file. This is a FASTA
jpayne@69 191 style header and lists the two sequences that produced the following alignments
jpayne@69 192 after a '>' and separated by a space, after the two sequences are the lengths
jpayne@69 193 of those sequences in the same order. An example header might look like:
jpayne@69 194
jpayne@69 195 >tagA1 tagB1 500 2000000
jpayne@69 196
jpayne@69 197 Following this sequence header is the alignment data. Each alignment region
jpayne@69 198 has a header that describes the start and end coordinates of the alignment in
jpayne@69 199 each sequence. These coordinates are inclusive and reference the forward strand
jpayne@69 200 of the current sequence. Thus, if the start coordinate is greater than the end
jpayne@69 201 coordinate, the alignment is on the reverse strand. The four digits are the
jpayne@69 202 start and end in the reference sequence respectively and the start and end in
jpayne@69 203 the query sequence respectively. These coordinates are always measured in DNA
jpayne@69 204 bases regardless of the alignment data type. The three digits after the starts
jpayne@69 205 and stops are the number of errors (non-identities), similarity errors (non-
jpayne@69 206 positive match scores) and non-alpha characters in the sequence (used to count
jpayne@69 207 stop-codons i promer data). An example header might look like:
jpayne@69 208
jpayne@69 209 5198 22885 5389 23089 20 20 0
jpayne@69 210
jpayne@69 211 Each of these headers is followed by a string of signed digits, one per line,
jpayne@69 212 with the final line before the next header equaling 0 (zero). Each digit
jpayne@69 213 represents the distance to the next insertion in the reference (positive int)
jpayne@69 214 or deletion in the reference (negative int), as measured in DNA bases or amino
jpayne@69 215 acids depending on the alignment data type. For example, with 'nucmer' the
jpayne@69 216 delta sequence (1, -3, 4, 0) would represent an insertion at positions 1 and 7
jpayne@69 217 in the reference sequence and an insertion at position 3 in the query sequence.
jpayne@69 218 Or with letters:
jpayne@69 219
jpayne@69 220 A = acgtagctgag$
jpayne@69 221 B = cggtagtgag$
jpayne@69 222 Delta = (1, -3, 4, 0)
jpayne@69 223 A = acg.tagctgag$
jpayne@69 224 B = .cggtag.tgag$
jpayne@69 225
jpayne@69 226 Using this delta information, it is possible to re-generate the alignment
jpayne@69 227 calculated by 'nucmer' or 'promer' as is done in the 'show-coords' program. This
jpayne@69 228 allows various utilities to be crafted to process and analyze the alignment
jpayne@69 229 data using a universal format. Below is what a .delta file might look like:
jpayne@69 230
jpayne@69 231 /home/username/reference.fasta /home/username/query.fasta
jpayne@69 232 NUCMER
jpayne@69 233 >tagA1 tagB1 500 2000000
jpayne@69 234 88 198 1641558 1641668 0 0 0
jpayne@69 235 0
jpayne@69 236 167 4877 1 4714 15 15 0
jpayne@69 237 2456
jpayne@69 238 1
jpayne@69 239 -11
jpayne@69 240 769
jpayne@69 241 950
jpayne@69 242 1
jpayne@69 243 1
jpayne@69 244 -142
jpayne@69 245 -1
jpayne@69 246 0
jpayne@69 247 >tagA2 tagB4 50000 30000
jpayne@69 248 5198 22885 5389 23089 18 18 0
jpayne@69 249 -6
jpayne@69 250 -32
jpayne@69 251 -1
jpayne@69 252 -1
jpayne@69 253 -1
jpayne@69 254 7
jpayne@69 255 1130
jpayne@69 256 0
jpayne@69 257
jpayne@69 258
jpayne@69 259
jpayne@69 260 *** .cluster OUTPUT ***
jpayne@69 261
jpayne@69 262 This output format is for debugging purposes and is now only available by
jpayne@69 263 using the -d switch for the 'postnuc' program.
jpayne@69 264
jpayne@69 265 This output file is a list of the match clusters that were generated by the
jpayne@69 266 'mgaps' MUMmer3.0 program. It is primarily a 5 column list, with the exception
jpayne@69 267 of the headers to be described later. 2 example rows could read:
jpayne@69 268
jpayne@69 269 1788 1622 59 - -
jpayne@69 270 1857 1691 23 10 10
jpayne@69 271
jpayne@69 272 Where the first column is the start coordinate of the match in the reference
jpayne@69 273 sequence, the second column is the start coordinate of the match in the query
jpayne@69 274 sequence, the third column is the length of the match, and the two final
jpayne@69 275 columns are the distance between the previous match's end and the current
jpayne@69 276 match's start (the gap distance). All coordinates reference the forward strand
jpayne@69 277 of each sequence, regardless of match direction, and are ALWAYS measured in
jpayne@69 278 DNA bases regardless of alignment data type (DNA or amino acid).
jpayne@69 279 Each individual cluster is preceded by two digits (1 or -1). These two
jpayne@69 280 digits represent the direction of the cluster, either forward or reverse
jpayne@69 281 complement, in each sequence. A " 1 -1" would represent a match on the forward
jpayne@69 282 strand of the reference and the reverse strand of the query, while a " 1 1"
jpayne@69 283 would represent a forward match on each strand. Take note that since the
jpayne@69 284 match coordinates reference the forward strand, forward matches will have
jpayne@69 285 ascending matches and a reverse matches will have descending matches. Also,
jpayne@69 286 since the query is the only sequence every reverse complemented, expect the
jpayne@69 287 first digit on the cluster header to always be 1.
jpayne@69 288 There are also 3 other types of headers. The first line of each .cluster
jpayne@69 289 file lists the two original input files separated by a space. The second line
jpayne@69 290 of each .cluster file lists the type of alignment data, either "NUCMER" or
jpayne@69 291 "PROMER". The third type of header resembles a FASTA header, and lists the
jpayne@69 292 two sequences that produced the following clusters after a '>' and their
jpayne@69 293 respective lengths separated by a whitespace. Note that each of these headers
jpayne@69 294 is unique, so all clusters/matches between any two sequences will appear under
jpayne@69 295 a single header identifying those two sequences. Below is a short example of
jpayne@69 296 what a .cluster file might look like:
jpayne@69 297
jpayne@69 298 /home/username/reference.fasta /home/username/query.fasta
jpayne@69 299 NUCMER
jpayne@69 300 >tagA1 tagB1 1000 2000000
jpayne@69 301 1 1
jpayne@69 302 88 1641558 111 - -
jpayne@69 303 1 1
jpayne@69 304 183 17 22 - -
jpayne@69 305 238 72 108 33 33
jpayne@69 306 347 181 92 1 1
jpayne@69 307 458 292 50 19 19
jpayne@69 308 509 343 35 1 1
jpayne@69 309 >tagA2 tagB1 100000 2000000
jpayne@69 310 1 -1
jpayne@69 311 86855 102105 23 - -
jpayne@69 312 86882 102078 77 4 4