comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 -=- MUMmer3.x README -=-
2
3 ** NOTE **
4 A comprehensive HTML user manual is available in the docs/web/manual
5 subdirectory or at http://mummer.sourceforge.net/manual
6
7 MUMmer is now an open source package! Please contact us if you would like
8 to contribute to the MUMmer project. For more information or the latest
9 release please visit the MUMmer homepage at http://mummer.sourceforge.net
10
11 Please refer to the INSTALL file for installation instructions. This file
12 contains brief descriptions of all executables in the base directory and
13 general information about the MUMmer package.
14
15
16
17 -- DESCRIPTION --
18 MUMmer is a system for rapidly aligning entire genomes. The current
19 version (release 3.0) can find all 20 base pair maximal exact matches between
20 two bacterial genomes of ~5 million base pairs each in 20 seconds, using 90 MB
21 of memory, on a typical 1.8 GHz Linux desktop computer. MUMmer can also align
22 incomplete genomes; it handles the 100s or 1000s of contigs from a shotgun
23 sequencing project with ease, and will align them to another set of contigs or
24 a genome, using the nucmer utility included with the system. The promer
25 utility takes this a step further by generating alignments based upon the
26 six-frame translations of both input sequences. promer permits the alignment
27 of genomes for which the proteins are similar but the DNA sequence is too
28 divergent to detect similarity. See the nucmer and promer readme files in the
29 "docs/" subdirectory for more details. MUMmer is open source, so all we ask
30 is that you cite our most recent paper in any publications that use this
31 system:
32
33 (Version 3.0 described)
34 Versatile and open software for comparing large genomes.
35 S. Kurtz, A. Phillippy, A.L. Delcher,
36 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
37 Genome Biology (2004), 5:R12.
38
39 (Version 2.1 described)
40 Fast algorithms for large-scale genome alignment and comparison.
41 A.L. Delcher. A. Phillippy, J. Carlton, and S.L. Salzberg.
42 Nucleic Acids Research 30:11 (2002), 2478-2483.
43
44 (Version 1.0 described)
45 Alignment of Whole Genomes.
46 A.L. Delcher, S. Kasif,
47 R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg.
48 Nucleic Acids Research, 27:11 (1999), 2369-2376.
49
50
51 -- RUNNING MUMmer3.0 --
52 MUMmer3.0 is comprised of many various utilities and scripts. For general
53 purposes, the scripts "run-mummer1", "run-mummer3", "nucmer", and "promer"
54 will be all that is needed. See their descriptions in the "RUNNING THE MUMmer
55 SCRIPTS" section, or refer to their individual documentation in the "docs/"
56 subdirectory. Refer to the "RUNNING THE MUMmer UTILITIES" section for a brief
57 description of all of the utilities in this directory.
58
59 Simple use case:
60 Given a file containing a single reference sequence (ref.seq) in
61 FASTA format and another file containing multiple sequences in FastA
62 format (qry.seq) type the following at the command line:
63
64 './nucmer -p <prefix> ref.seq qry.seq'
65
66 To produce the following files:
67 <prefix>.delta
68
69 or
70
71 './run-mummer3.csh ref.seq qry.seq <prefix>'
72
73 To produce the following files:
74 <prefix>.out
75 <prefix>.gaps
76 <prefix>.align
77 <prefix>.errorsgaps
78
79 Please read the utility-specific documentation in the "docs/" subdirectory
80 for descriptions of these files and information on how to change the
81 alignment parameters for the scripts (minimum match length, etc.), or see
82 the notes below in the "RUNNING THE MUMmer SCRIPTS" section for a brief
83 explanation.
84
85 To see a simple gnuplot output, if you have gnuplot installed, run
86 the perl script 'mummerplot' on the output files. This script can be run
87 on mummer output (.out), or nucmer/promer output (.delta). Edit the
88 <prefix>.gp file that is created to change colors, line thicknesses, etc. or
89 explore the <prefix>.[fr]plot file to see the data collection.
90
91 './mummerplot -p <prefix> <prefix>.out'
92
93 Or you can use the web viewer for completed microbial genomes:
94 http://www.tigr.org/CMR
95
96
97
98 -- RUNNING THE MUMmer SCRIPTS --
99 Because of MUMmer's modular design, it may be necessary to use a number
100 of separate programs to produce the desired output. The MUMmer scripts
101 attempt to simplify this process by wrapping various utilities into packages
102 that can perform standard alignment requests. Listed below are brief
103 descriptions and usage definitions for these scripts. Please refer to the
104 "docs/" subdirectory for a more detailed description of each script.
105
106
107 ** nucmer **
108
109 DESCRIPTION:
110 nucmer is for the all-vs-all comparison of nucleotide sequences
111 contained in multi-FastA data files. It is best used for highly
112 similar sequence that may have large rearrangements. Common use
113 cases are: comparing two unfinished shotgun sequencing assemblies,
114 mapping an unfinished sequencing assembly to a finished genome, and
115 comparing two fairly similar genomes that may have large
116 rearrangements and duplications. Please refer to "docs/nucmer.README"
117 for more information regarding this script and its output, or type
118 'nucmer -h' for a list of its options.
119
120 USAGE:
121 nucmer [options] <reference> <query>
122
123 [options] type 'nucmer -h' for a list of options.
124 <reference> specifies the multi-FastA sequence file that contains
125 the reference sequences, to be aligned with the queries.
126 <query> specifies the multi-FastA sequence file that contains
127 the query sequences, to be aligned with the references.
128
129 OUTPUT:
130 out.delta the delta encoded alignments between the reference and
131 query sequences. This file can be parsed with any of
132 the show-* programs which are described in the "RUNNING
133 THE MUMmer UTILITIES" section.
134
135 NOTES:
136 All output coordinates reference the forward strand of the involved
137 sequence, regardless of the match direction. Also, nucmer now uses
138 only matches that are unique in the reference sequence by default,
139 use the '--mum' or '--maxmatch' options to change this behavior.
140
141
142 ** promer **
143
144 DESCRIPTION:
145 promer is for the protein level, all-vs-all comparison of nucleotide
146 sequences contained in multi-FastA data files. The nucleotide input
147 files are translated in all 6 reading frames and then aligned to one
148 another via the same methods as nucmer. It is best used for highly
149 divergent sequences that may have moderate to high similarity on the
150 protein level. Common use cases are: identifying syntenic regions
151 between highly divergent genomes, comparative genome annotation i.e.
152 using an already annotated genome to help in the annotation of a
153 newly sequenced genome, and the general comparison of two fairly
154 divergent genomes that have large rearrangements and may only be
155 similar on the protein level. Please refer to "docs/promer.README"
156 for more information regarding this script and its output, or type
157 'promer -h' for a list of its options.
158
159 USAGE:
160 promer [options] <reference> <query>
161
162 [options] type 'promer -h' for a list of options.
163 <reference> specifies the multi-FastA sequence file that contains
164 the reference sequences, to be aligned with the queries.
165 <query> specifies the multi-FastA sequence file that contains
166 the query sequences, to be aligned with the references.
167
168 OUTPUT:
169 out.delta the delta encoded alignments between the reference and
170 query sequences. This file can be parsed with any of
171 the show-* programs which are described in the "RUNNING
172 THE MUMmer UTILITIES" section.
173
174 NOTES:
175 All output coordinates reference the forward strand of the involved
176 sequence, regardless of the match direction, and are measured in
177 nucleotides with the exception of the delta integers which are
178 measured in amino acids (1 delta int = 3 nucleotides). Also, promer
179 now uses only matches that are unique in the reference sequence by
180 default, use the '--mum' or '--maxmatch' options to change this
181 behavior.
182
183
184 ** run-mummer1 **
185
186 DESCRIPTION:
187 This script is taken directly from MUMmer1.0 and is best used to
188 align two sequences in which there is high similarity and no re-
189 arrangements. Common use cases are: aligning two finished bacterial
190 chromosomes. Please refer to "docs/run-mummer1.README" for the
191 original documentation for this script and its output.
192
193 USAGE:
194 run-mummer1 <seq1> <seq2> <tag> [-r]
195
196 <seq1> specifies the file with the first sequence in FastA format.
197 No more than one sequence is allowed.
198 <seq2> specifies the file with the second sequence in FastA format.
199 No more than one sequence is allowed.
200 <tag> specifies the prefix to be used for the output files.
201 [-r] is an optional parameter that will reverse complement the
202 second sequence.
203
204 OUTPUT:
205 out.align the out.gaps file interspersed with the alignments
206 of the gaps.
207 out.errorsgaps the out.gaps file with an extra column stating the
208 number of errors contained in each gap.
209 out.gaps an ordered (clustered) list of matches with position
210 information, and gap distances between each match.
211 out.out a list of all maximal unique matches between the two
212 input sequences ordered by their start position in the
213 second sequence.
214
215 NOTES:
216 All output coordinates reference their respective strand. This means
217 that if the -r switch is active, coordinates that reference the
218 second sequence will be relative to the reverse complement of the
219 second sequence. Please use nucmer or promer if this coordinate
220 system is confusing.
221 Eventually, this script's components will be rewritten to work
222 with the new MUMmer format standards and phased out in favor of the
223 new components and wrapping script.
224
225
226 ** run-mummer3 **
227
228 DESCRIPTION:
229 This script is the improved version of the MUMmer1.0 run-mummer1
230 script. It uses a new clustering algorithm that appropriately
231 handles multiple sequence rearrangements and inversions. Because
232 of this, it can handle more divergent sequences better than
233 run-mummer1. In addition, it allows a multi-FastA query file for
234 1-vs-many sequence comparisons. Please refer to
235 "docs/run-mummer3.README" for more detailed documentation of this
236 script and its output.
237
238 USAGE:
239 run-mummer3 <reference> <query> <prefix>
240
241 <reference> specifies the file with the reference sequence in FastA
242 format. No more than one sequence is allowed.
243 <query> specifies the multi-FastA sequence file that contains
244 the query sequences.
245 <prefix> specifies the file prefix for the output files.
246
247 OUTPUT:
248 out.align the out.gaps file interspersed with the alignments
249 of the gaps.
250 out.errorsgaps the out.gaps file with an extra column stating the
251 number of errors contained in each gap.
252 out.gaps an ordered (clustered) list of matches with position
253 information, and gap distances between each match.
254 out.out a list of all maximal unique matches between the two
255 input sequences ordered by their start position in the
256 second sequence.
257
258 NOTES:
259 All output coordinates reference their respective strand. This means
260 that for all reverse matches, the coordinates that reference the
261 query sequence will be relative to the reverse complement of the
262 query sequence. Please use nucmer or promer if this coordinate
263 system is confusing.
264
265
266 ** dnadiff **
267
268 DESCRIPTION:
269 This script is a wrapper around nucmer that builds an
270 alignment using default parameters, and runs many of nucmer's
271 helper scripts to process the output and report alignment
272 statistics, SNPs, breakpoints, etc. It is designed for
273 evaluating the sequence and structural similarity of two
274 highly similar sequence sets. E.g. comparing two different
275 assemblies of the same organism, or comparing two strains of
276 the same species. Please refer to "docs/dnadiff.README" for
277 more information regarding this script and its output, or type
278 'dnadiff -h' for a list of its options.
279
280 USAGE: dnadiff [options] <reference> <query>
281 or dnadiff [options] -d <delta file>
282
283 <reference> Set the input reference multi-FASTA filename
284 <query> Set the input query multi-FASTA filename
285 or
286 <delta file> Unfiltered .delta alignment file from nucmer
287
288 OUTPUT:
289 .report - Summary of alignments, differences and SNPs
290 .delta - Standard nucmer alignment output
291 .1delta - 1-to-1 alignment from delta-filter -1
292 .mdelta - M-to-M alignment from delta-filter -m
293 .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
294 .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
295 .snps - SNPs from show-snps -rlTHC .1delta
296 .rdiff - Classified ref breakpoints from show-diff -rH .mdelta
297 .qdiff - Classified qry breakpoints from show-diff -qH .mdelta
298 .unref - Unaligned reference IDs and lengths (if applicable)
299 .unqry - Unaligned query IDs and lengths (if applicable)
300
301 NOTES:
302 The report file generated by this script can be useful for
303 comparing the differences between two similar genomes or
304 assemblies. The other outputs generated by this script are in
305 unlabeled tabular format, so please refer to the utility
306 specific documentation for interpreting them. A full
307 description of the report file is given in "docs/dnadiff.README".
308
309
310 -- RUNNING THE MUMmer UTILITIES --
311 The MUMmer package consists of various utilities that can interact with
312 the 'mummer' program. 'mummer' performs all maximal and maximal unique
313 matching, and all other utilities were designed to process the input and
314 output of this program and its related scripts, in order to extract
315 additional information from the output. Listed below are the descriptions
316 and usage definitions for these utilities.
317
318
319 ** annotate **
320
321 DESCRIPTION:
322 This program reads the output of the 'gaps' program and adds alignment
323 information to it. Part of the original MUMmer1.0 pipeline and can
324 only be used on the output of the 'gaps' program.
325
326 USAGE:
327 annotate <gapsfile> <seq2>
328
329 <gapsfile> the output of the 'gaps' program.
330 <seq2> the file containing the second sequence in the comparison.
331
332 OUTPUT:
333 stdout the 'gaps' output interspersed with the alignments of
334 the gaps between adjacent MUMs. An alignment of a
335 gap comes after the second MUM defining the gap, and
336 alignment errors are marked with a '^' character.
337 witherrors.gaps the 'gaps' output with an appended column that lists
338 the number of alignment errors for each gap.
339
340 NOTES:
341 This program will eventually be dropped in favor of the combineMUMs
342 or nucmer match extenders, but persists for the time being.
343
344
345 ** combineMUMs **
346
347 DESCRIPTION:
348 This program reads the output of the 'mgaps' program and adds alignment
349 information to it. Part of the MUMmer3.0 pipeline and can only be
350 used on the output of the 'mgaps' program. This -D option alters this
351 behavior and only outputs the positions of difference, e.g. SNPs.
352
353 USAGE:
354 combineMUMs [options] <reference> <query> <mgapsfile>
355
356 [options] type 'combineMUMs -h' for a list of options.
357 <reference> the FastA reference file used in the comparison.
358 <query> the multi-FastA reference file used in the comparison.
359 <mgapsfile> the output of the 'mgaps' program run on the match
360 list produced by 'mummer' for the reference and query
361 files.
362
363 OUTPUT:
364 stdout the 'mgaps' output interspersed with the alignments
365 of the gaps between adjacent MUMs. An alignment of a
366 gap comes after the second MUM defining the gap, and
367 alignment errors are marked with a '^' character. At
368 the end of each cluster is a summary line (keyword
369 "Region") noting the bounds of the cluster in the
370 reference and query sequences, the total number of
371 errors for the region, the length of the region and
372 the percent error of the region.
373 witherrors.gaps the 'mgaps' output with an appended column that lists
374 the number of alignment errors for each gap.
375
376
377 ** delta-filter **
378
379 DESCRIPTION:
380
381 This program filters a delta alignment file produced by either
382 nucmer or promer, leaving only the desired alignments which
383 are output to stdout in the same delta format as the
384 input. Its primary function is the LIS algorithm which
385 calculates the longest increasing subset of alignments. This
386 allows for the calculation of a global set of alignments
387 (i.e. 1-to-1 and mutually consistent order) with the -g option
388 or locally consistent with -1 or -m. Reference sequences can
389 be mapped to query sequences with -r, or queries to references
390 with -q. This allows the user to exclude chance and repeat
391 induced alignments, leaving only the "best" alignments between
392 the two data sets. Filtering can also be performed on length,
393 identity, and uniquenes.
394
395 USAGE:
396 delta-filter [options] <deltafile>
397
398 [options] type 'delta-filter -h' for a list of options.
399 <deltafile> the .delta output file from either nucmer or promer.
400
401 OUTPUT:
402 stdout The same delta alignment format as output by nucmer and promer.
403
404 NOTES:
405 For most cases the -m option is recommended, however -1 is
406 useful for applications that require a 1-to-1 mapping, such as
407 SNP finding. Use the -q option for mapping query contigs to
408 their best reference location.
409
410
411 ** exact-tandems **
412
413 DESCRIPTION:
414 This script finds exact tandem repeats in a specified FastA sequence
415 file. It is a post-processor for 'repeat-match' and provides a simple
416 interface and output for tandem repeat detection.
417
418 USAGE:
419 exact-tandems <file> <min match>
420
421 <file> the single sequence in FastA format to search for repeats.
422 <min match> the minimum match length for the tandems.
423
424 OUTPUT:
425 stdout 4 columns, the start of the tandem repeat, the total extent
426 of the repeat region, the length of each repetitive unit, and
427 to total copies of the repetitive unit involved.
428
429
430 ** gaps **
431
432 DESCRIPTION:
433 This program reads a list of unique matches between two strings and
434 outputs the longest consistent set of matches, followed by all the
435 other matches. Part of the MUMmer1.0 pipeline and the output of the
436 'mummer' program needs to be processed (to strip all non-match lines)
437 before it can be passed to this program.
438
439 USAGE:
440 gaps <seq1> [-r] < <matchlist>
441
442 <seq1> The first sequence file that the match list represents.
443 <matchlist> A simple list of matches and NO header lines or other
444 mumbo jumbo. The columns of the match list should be
445 start in the reference, start in the query, and length
446 of the match.
447 [-r] Simply puts the string "reverse" on the header of the
448 output so 'annotate' knows to reverse the second
449 sequence.
450
451 OUTPUT:
452 stdout an ordered set of the input matches, separated by headers.
453 The first set is the longest consistent set of matches and
454 the second set is all other matches.
455
456 NOTES:
457 This program will eventually be rewritten to be interchangeable with
458 'mgaps', so that it may be plugged into the nucmer or promer
459 pipelines.
460
461
462 ** mapview **
463
464 DESCRIPTION:
465 mapview is a utility program for displaying sequence alignments as
466 provided by MUMmer, nucmer or promer. This program takes the output
467 from these alignment routines and converts it to a FIG, PDF or PS
468 file for visual analysis. It can also break the output into multiple
469 files for easier viewing and printing. Please refer to
470 "docs/mapview.README" for a more detailed description and explination.
471
472 USAGE:
473 mapview [options] <coords file> [UTR coords] [CDS coords]
474
475 [options] type 'mapview -h' for a list of options.
476 <coords file> show-coords output file
477 [UTR coords] UTR coordinate file in GFF format
478 [CDS coords] CDS coordinate file in GFF format
479
480 OUTPUT:
481 Default output format is an xfig file, however this can be changed to
482 a postscript of PDF file with the -f option. See 'mapview -h' for a
483 list of available formatting options.
484
485 NOTES:
486 The produce the coords file input, 'show-coords' must be run with the
487 -r -l options. To reduce redundant matches in promer output, run
488 show-coords with the -k option. To generate output formats other than
489 xfig, the fig2dev utility must be available from the system path. For
490 very large reference genomes, FIG format may be the only option that
491 will allow the entire display to be stored in one file, as fig2dev has
492 problems if the output is too large.
493
494
495 ** mgaps **
496
497 DESCRIPTION:
498 This program reads a list of matches between a single-FastA reference
499 and a multi-FastA query file and outputs clusters of matches that lie
500 on similar diagonals and within a reasonable distance. Part of the
501 MUMmer3.0 pipeline and the output of 'mummer' need not be processed
502 before passing it to this program, so long as 'mummer' was run on a
503 1-vs-many or 1-vs-1 dataset.
504
505 USAGE:
506 mgaps [options] < <matchlist>
507
508 [options] type 'mgaps -h' for a list of options.
509 <matchlist> A list of matches separated by their sequence FastA tags.
510 The columns of the match list should be start in
511 reference, start in query, and length of the match.
512
513 OUTPUT:
514 stdout An ordered set of the input matches, separated by headers.
515 Individual clusters are separated by a '#' character and
516 sets of clusters from different sequences are separated by
517 the FastA header tag for the query sequence.
518
519 NOTES:
520 It is often very helpful to adjust the clustering parameters. Check
521 'mgaps -h' for the list of parameters and check the source for a
522 better idea of how each parameter affects the result. Often, it is
523 helpful to run this program a number of times with different
524 parameters until the desired result is achieved.
525
526
527 ** mummer **
528
529 DESCRIPTION:
530 This is the core program of the MUMmer package. It is the suffix-tree
531 based match finding routine, and the main part of every MUMmer script.
532 For a detailed manual describing how to use this program, please refer
533 to "docs/maxmat3man.pdf" or in LaTeX format "docs/maxmat3man.tex". By
534 default, 'mummer' now finds maximal matches regardless of their
535 uniqueness. Limiting the output to only unique matches can be specified
536 as a command line switch.
537
538 USAGE:
539 mummer [options] <reference> <query> ...
540
541 [options] type 'mummer -help' for a list of options.
542 <reference> specifies the single or multi-FastA sequence file that
543 contains the reference sequence(s), to be aligned with
544 the queries.
545 <query> specifies the multi-FastA sequence file that contains
546 the query sequences, to be aligned with the references.
547 Multiple query files are allowed, up to 32.
548
549 OUTPUT:
550 stdout a list of exact matches. Varies depending on input, refer to
551 the manual specified in the description above.
552
553 NOTES:
554 Many thanks to Stefan Kurtz for the latest mummer version. 'mummer'
555 now behaves like the old 'mummer2' program by default. The -mum switch
556 forces it to behave like 'mummer1', the -mumreference switch forces it
557 to behave like 'mummer2' while the -maxmatch switch forces it to behave
558 like the old 'max-match' program.
559
560
561 ** mummerplot **
562
563 DESCRIPTION:
564 mummerplot is a perl script that generates gnuplot scripts and data
565 collections for plotting with the gnuplot utility. It can generate
566 2-d dotplots and 1-d coverage plots for the output of mummer, nucmer,
567 promer or show-tiling. It can also color dotplots with an identity
568 color gradient.
569
570 USAGE:
571 mummerplot [options] <matchfile>
572
573 [options] type 'mummerplot -h' for a list of options.
574 <matchfile> the output of 'mummer', 'nucmer', 'promer', or
575 'show-tiling'. 'mummerplot' will automatically determine
576 the format of the data it was given and produce the plot
577 accordingly.
578
579 OUTPUT:
580 out.gp The gnuplot script, type 'gnuplot out.gp' to evaluate the
581 the gnuplot script.
582 out.fplot
583 out.rplot
584 out.hplot The forward, reverse and highlighted match information for
585 plotting with gnuplot.
586
587 out.ps
588 out.png The plotted image file, postscript or png depending on the
589 selected terminal type.
590
591 NOTES:
592 For alignments with multiple reference or query sequences, be sure to
593 use the -r -q or -R -Q options to avoid overlaying multiple plots in
594 the same space. For better looking color gradient plots, try the
595 postscript terminal and avoid the png terminal.
596
597
598 ** nucmer2xfig **
599
600 DESCRIPTION:
601 Script for plotting nucmer hits against a reference sequence. See top
602 of script for more information, or see if 'mummerplot' or 'mapview'
603 has the functionality required as they are properly maintained.
604
605
606 ** repeat-match **
607
608 DESCRIPTION:
609 Finds exact repeats within a single sequence.
610
611 USAGE:
612 repeat-match [options] <seq>
613
614 [options] type 'repeat-match -h' for a list of options.
615 <seq> the single sequence in FastA format to search for repeats.
616
617 OUTPUT:
618 stdout 3 columns, the start of the first copy of the repeat, the
619 start of the second copy of the repeat, and the length of the
620 repeat respectively.
621
622 NOTES:
623 REPuter (freely available for universities) may be better suited for
624 most repeat matching, but 'repeat-match' is open-source and has some
625 functionality that REPuter does not so we include it along with the
626 MUMmer package.
627
628
629 ** show-aligns **
630
631 DESCRIPTION:
632 This program parses the delta alignment output of nucmer and promer
633 and displays all of the pairwise alignments from the two sequences
634 specified on the command line.
635
636 USAGE:
637 show-aligns [options] <deltafile> <IdR> <IdQ>
638
639 [options] type 'show-aligns -h' for a list of options.
640 <deltafile> the .delta output file from either nucmer or promer.
641 <IdR> the FastA header tag of the desired reference sequence.
642 <IdQ> the FastA header tag of the desired query sequence.
643
644 OUTPUT:
645 stdout each alignment header and footer describes the frame of the
646 alignment in each sequence, and the start and finish
647 (inclusive) of the alignment in each sequence. At the
648 beginning of each line of aligned sequence are two numbers, the
649 top is the coordinate of the first reference base on that line
650 and the bottom is the coordinate of the first query base on
651 that line. ALL coordinates reference the forward strand of the
652 DNA sequence, even if it is a protein alignment. A gap caused
653 by an insertion or deletion is filled with a '.' character.
654 Errors in a DNA alignment are marked with a '^' below the
655 error. Errors in an amino acid alignment are marked with a
656 whitespace in the middle consensus line, while matches are
657 marked with the consensus base and similarities are marked with
658 a '+' in the consensus line.
659
660
661 ** show-coords **
662
663 DESCRIPTION:
664 This program parses the delta alignment output of nucmer and promer
665 and displays the coordinates, and other useful information about the
666 alignments.
667
668 USAGE:
669 show-coords [options] <deltafile>
670
671 [options] type 'show-coords -h' for a list of options.
672 <deltafile> the .delta output file from either nucmer or promer.
673
674 OUTPUT:
675 stdout run 'show-coords' without the -H option to see the column
676 header tags. Here is a description of each tag. Note that
677 some of the below tags do not apply to nucmer data, and that
678 all coordinates are inclusive and relative to the forward DNA
679 strand.
680
681 [S1] Start of the alignment region in the reference sequence.
682
683 [E1] End of the alignment region in the reference sequence.
684
685 [S2] Start of the alignment region in the query sequence.
686
687 [E2] End of the alignment region in the query sequence.
688
689 [LEN 1] Length of the alignment region in the reference sequence,
690 measured in nucleotides.
691
692 [LEN 2] Length of the alignment region in the query sequence, measured
693 in nucleotides.
694
695 [% IDY] Percent identity of the alignment, calculated as the
696 (number of exact matches) / ([LEN 1] + insertions in the query).
697
698 [% SIM] Percent similarity of the alignment, calculated like the above
699 value, but counting positive BLOSUM matrix scores instead of exact
700 matches.
701
702 [% STP] Percent of stop codons of the alignment, calculated as
703 (number of stop codons) / (([LEN 1] + insertions in the query) * 2).
704
705 [LEN R] Length of the reference sequence.
706
707 [LEN Q] Length of the query sequence.
708
709 [COV R] Percent coverage of the alignment on the reference sequence,
710 calculated as [LEN 1] / [LEN R].
711
712 [COV Q] Percent coverage of the alignment on the query sequence,
713 calculated as [LEN 2] / [LEN Q].
714
715 [FRM] Reading frame for the reference sequence and the reading frame
716 for the query sequence respectively. This is one of the columns
717 absent from the nucmer data, however, match direction can easily be
718 determined by the start and end coordinates.
719
720 [TAGS] The reference FastA ID and the query FastA ID.
721
722 There is also an optional final column (turned on with the -w
723 or -o option) that will contain some 'annotations'. The -o option will
724 annotate alignments that represent overlaps between two sequences,
725 while the -w option is antiquated and should no longer be used.
726 Sometimes, nucmer or promer will extend adjacent clusters past one
727 another, thus causing a somewhat redundant output, this option will
728 notify users of such rare occurrences.
729
730 NOTES:
731 The -c and -l options are useful when comparing two sets of assembly
732 contigs, in that these options help determine if an alignment spans an
733 entire contig, or is just a partial hit to a different read. The -b
734 option is useful when the user wishes to identify sytenic regions
735 between two genomes, but is not particularly interested in the actual
736 alignment similarity or appearance. This option also disregards match
737 orientation, so should not be used if this information is needed.
738
739
740 ** show-diff **
741
742 DESCRIPTION:
743 This program classifies alignment breakpoints for the
744 quantification of macroscopic differences between two
745 genomes. It takes a standard, unfiltered delta file as input,
746 determines the best mapping between the two sequence sets, and
747 reports on the breaks in that mapping.
748
749 USAGE:
750 show-diff [options] <deltafile>
751
752 [options] type 'show-diff -h' for a list of options.
753 <deltafile> the .delta output file from nucmer
754
755 OUTPUT:
756 stdout Classified breakpoints are output one per line with
757 the following types and column definitions. The first
758 five columns of every row are seq ID, feature type,
759 feature start, feature end, and feature length.
760
761 Feature Columns
762
763 IDR GAP gap-start gap-end gap-length-R gap-length-Q gap-diff
764 IDR DUP dup-start dup-end dup-length
765 IDR BRK gap-start gap-end gap-length
766 IDR JMP gap-start gap-end gap-length
767 IDR INV gap-start gap-end gap-length
768 IDR SEQ gap-start gap-end gap-length prev-sequence next-sequence
769
770 Feature Types
771
772 [GAP] A gap between two mutually consistent ordered and
773 oriented alignments. gap-length-R is the length of the
774 alignment gap in the reference, gap-length-Q is the length of
775 the alignment gap in the query, and gap-diff is the difference
776 between the two gap lengths. If gap-diff is positive, sequence
777 has been inserted in the reference. If gap-diff is negative,
778 sequence has been deleted from the reference. If both
779 gap-length-R and gap-length-Q are negative, the indel is
780 tandem duplication copy difference.
781
782 [DUP] A duplicated sequence in the reference that occurs more
783 times in the reference than in the query. The coordinate
784 columns specify the bounds and length of the
785 duplication. These features are often bookended by BRK
786 features if there is unique sequence bounding the duplication.
787
788 [BRK] An insertion in the reference of unknown origin, that
789 indicates no query sequence aligns to the sequence bounded by
790 gap-start and gap-end. Often found around DUP elements or at
791 the beginning or end of sequences.
792
793 [JMP] A relocation event, where the consistent ordering of
794 alignments is disrupted. The coordinate columns specify the
795 breakpoints of the relocation in the reference, and the
796 gap-length between them. A negative gap-length indicates the
797 relocation occurred around a repetitive sequence, and a
798 positive length indicates unique sequence between the
799 alignments.
800
801 [INV] The same as a relocation event, however both the
802 ordering and orientation of the alignments is disrupted. Note
803 that for JMP and INV, generally two features will be output,
804 one for the beginning of the inverted region, and another for
805 the end of the inverted region.
806
807 [SEQ] A translocation event that requires jumping to a new
808 query sequence in order to continue aligning to the
809 reference. If each input sequence is a chromosome, these
810 features correspond to inter-chromosomal translocations.
811
812 NOTES:
813 The estimated number of features, take inversions for example,
814 represents the number of breakpoints classified as bordering
815 an inversion. Therefore, since there will be a breakpoint at
816 both the beginning and the end of an inversion, the feature
817 counts are roughly double the number of inversion events. In
818 addition, all counts are estimates and do not represent the
819 exact number of each evolutionary event.
820
821 Summing the fifth column (ignoring negative values) yeilds an
822 estimate of the total inserted sequence in the
823 reference. Summing the fifth column after removing DUP
824 features yields an estimate of the total amount of unique
825 (unaligned) sequence in the reference. Note that unaligned
826 sequences are not counted, and could represent additional
827 "unique" sequences. Use the 'dnadiff' script if you must
828 recover this information. Finally, the -q option switches
829 references for queries, and uses the query coordinates for the
830 analysis.
831
832
833 ** show-snps **
834
835 DESCRIPTION:
836 This program reports polymorphism contained in a delta encoded
837 alignment file output by either nucmer or promer. It catalogs
838 all of the single nucleotide polymorphisms (SNPs) and
839 insertions/deletions within the delta file
840 alignments. Polymorphisms are reported one per line, in a
841 delimited fashion similar to show-coords. Pairing this program
842 with the appropriate MUMmer tools can create an easy to use
843 SNP pipeline for the rapid identification of putative SNPs
844 between any two sequence sets.
845
846 USAGE:
847 show-snps [options] <deltafile>
848
849 [options] type 'show-snps -h' for a list of options.
850 <deltafile> the .delta output file from either nucmer or promer.
851
852 OUTPUT:
853 stdout Standard output has column headers with the following
854 meanings. Not all columns will be output by default,
855 see 'show-snps -h' for switch to control the output.
856
857 [P1] SNP position in the reference.
858
859 [SUB] Character in the reference.
860
861 [SUB] Character in the query.
862
863 [P2] SNP position in the query.
864
865 [BUFF] Distance from this SNP to the nearest mismatch (end of
866 alignment, indel, SNP, etc) in the same alignment.
867
868 [DIST] Distance from this SNP to the nearest sequence end.
869
870 [R] Number of repeat alignments which cover this reference
871 position, >0 means repetitive sequence.
872
873 [Q] Number of repeat alignments which cover this query
874 position, >0 means repetitive sequence.
875
876 [LEN R] Length of the reference sequence.
877
878 [LEN Q] Length of the query sequence.
879
880 [CTX R] Surrounding context sequence in the reference.
881
882 [CTX Q] Surrounding context sequence in the query.
883
884 [FRM] Reading frame for the reference sequence and the
885 reading frame for the query sequence respectively. Simply
886 'forward' 1, or 'reverse' -1 for nucmer data.
887
888 [TAGS] The reference FastA ID and the query FastA ID.
889
890 NOTES:
891 It is often helpful to run this with the -C option to assure
892 reported SNPs are only reported from uniquely aligned regions.
893
894
895 ** show-tiling **
896
897 DESCRIPTION:
898 This program attempts to construct a tiling path out of the query
899 contigs as mapped to the reference sequences. Given the delta
900 alignment information of a few long reference sequences and many small
901 query contigs, 'show-tiling' will determine the best location on a
902 reference for each contig. Note that each contig may only be tiled
903 once, so repetitive regions may cause this program some difficulty.
904 This program is useful for aiding in the scaffolding and closure of an
905 unfinished set of contigs, if a suitable, high similarity, reference
906 genome is available. Or, if using promer, 'show-tiling' will help
907 in the identification of syntenic regions and their contig's mapping
908 the the references.
909
910 USAGE:
911 show-tiling [options] <deltafile>
912
913 [options] type 'show-tiling -h' for a list of options.
914 <deltafile> the .delta output file from either nucmer or promer.
915
916 OUTPUT:
917 stdout Standard output has 8 columns: start in reference, end in
918 reference, gap between this contig and the next, length of this
919 contig, alignment coverage of this contig, average percent
920 identity of the alignments for this contig, orientation of this
921 contig, contig ID. All matches to a reference are headed by the
922 FASTA tag of that reference. Output with the -a option is the
923 same as 'show-coords -cl' when run on nucmer data.
924
925 NOTES:
926 When run with the -x option, 'show-tiling' will produce an XML output
927 format that can be accepted by TIGR's open source scaffolding software
928 'Bambus' as contig linking information.
929
930
931 -- CONTACT INFORMATION --
932
933 Please address questions and bug reports to: <mummer-help@lists.sourceforge.net>
934
935 Last Revised May 12, 2005