annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/dnadiff.README @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
rev   line source
jpayne@69 1 --------------------------------------------------------------------------------
jpayne@69 2 dnadiff is a wrapper for nucmer and analysis utilities that provides
jpayne@69 3 detailed information on the differences between two genomes, and also
jpayne@69 4 provides a high level report file that quantifies the differences
jpayne@69 5 between the two inputs.
jpayne@69 6
jpayne@69 7 Use Cases:
jpayne@69 8 + diff'ing two strains of the same species
jpayne@69 9 + diff'ing two assemblies of the same organism
jpayne@69 10 + diff'ing a draft assembly and a closely related finished genome
jpayne@69 11
jpayne@69 12 If any of this code is used in any publication, please cite the following:
jpayne@69 13
jpayne@69 14 Versatile and open software for comparing large genomes.
jpayne@69 15 S. Kurtz, A. Phillippy, A.L. Delcher,
jpayne@69 16 M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
jpayne@69 17 Genome Biology (2004), 5:R12.
jpayne@69 18
jpayne@69 19 --------------------------------------------------------------------------------
jpayne@69 20
jpayne@69 21 This manual is also available as HTML documentation included in this
jpayne@69 22 distribution, or at:
jpayne@69 23
jpayne@69 24 http://mummer.sourceforge.net
jpayne@69 25 http://mummer.sourceforge.net/manual
jpayne@69 26 http://mummer.sourceforge.net/examples
jpayne@69 27
jpayne@69 28
jpayne@69 29 -- DESCRIPTION --
jpayne@69 30 dnadiff is a wrapper around nucmer that builds an alignment using
jpayne@69 31 default parameters, and runs many of nucmer's helper scripts to
jpayne@69 32 process the output and report alignment statistics, SNPs, breakpoints,
jpayne@69 33 etc. It is designed for evaluating the sequence and structural
jpayne@69 34 similarity of two highly similar sequence sets. E.g. comparing two
jpayne@69 35 different assemblies of the same organism, or comparing two strains of
jpayne@69 36 the same species.
jpayne@69 37
jpayne@69 38
jpayne@69 39 -- dnadiff EXAMPLE --
jpayne@69 40 To compare two strains of the same species, type:
jpayne@69 41
jpayne@69 42 "dnadiff genome1.fna genome2.fna"
jpayne@69 43
jpayne@69 44 Output will be...
jpayne@69 45 out.report - Summary of alignments, differences and SNPs
jpayne@69 46 out.delta - Standard nucmer alignment output
jpayne@69 47 out.1delta - 1-to-1 alignment from delta-filter -1
jpayne@69 48 out.mdelta - M-to-M alignment from delta-filter -m
jpayne@69 49 out.1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
jpayne@69 50 out.mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
jpayne@69 51 out.snps - SNPs from show-snps -rlTHC .1delta
jpayne@69 52 out.rdiff - Classified ref breakpoints from show-diff -rH .mdelta
jpayne@69 53 out.qdiff - Classified qry breakpoints from show-diff -qH .mdelta
jpayne@69 54 out.unref - Unaligned reference sequence IDs and lengths
jpayne@69 55 out.unqry - Unaligned query sequence IDs and lengths
jpayne@69 56
jpayne@69 57 For more information on the formats and meanings of all the files
jpayne@69 58 produced, please see the documentation for the corresponding
jpayne@69 59 utility. This document serves to describe running the dnadiff script
jpayne@69 60 and interpreting the produced .report file.
jpayne@69 61
jpayne@69 62
jpayne@69 63 -- RUNNING 'dnadiff' --
jpayne@69 64
jpayne@69 65 USAGE: dnadiff [options] <Reference> <Query>
jpayne@69 66 or dnadiff [options] -d <Delta File>
jpayne@69 67
jpayne@69 68 DESCRIPTION:
jpayne@69 69 Run comparative analysis of two sequence sets using nucmer and its
jpayne@69 70 associated utilities with recommended parameters. See MUMmer
jpayne@69 71 documentation for a more detailed description of the
jpayne@69 72 output. Produces the following output files:
jpayne@69 73
jpayne@69 74 .delta - Standard nucmer alignment output
jpayne@69 75 .1delta - 1-to-1 alignment from delta-filter -1
jpayne@69 76 .mdelta - M-to-M alignment from delta-filter -m
jpayne@69 77 .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
jpayne@69 78 .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
jpayne@69 79 .snps - SNPs from show-snps -rlTHC .1delta
jpayne@69 80 .rdiff - Classified alignment breakpoints from show-diff -rH .mdelta
jpayne@69 81 .qdiff - Classified alignment breakpoints from show-diff -qH .mdelta
jpayne@69 82 .report - Summary of alignments, differences and SNPs
jpayne@69 83 .unref - Unaligned reference sequence IDs and lengths
jpayne@69 84 .unqry - Unaligned query sequence IDs and lengths
jpayne@69 85
jpayne@69 86 MANDATORY:
jpayne@69 87 Reference Set the input reference multi-FASTA filename
jpayne@69 88 Query Set the input query multi-FASTA filename
jpayne@69 89 or
jpayne@69 90 Delta File Unfiltered .delta alignment file from nucmer
jpayne@69 91
jpayne@69 92 OPTIONS:
jpayne@69 93 -d|delta Provide precomputed delta file for analysis
jpayne@69 94 -h
jpayne@69 95 --help Display help information and exit
jpayne@69 96 -p|prefix Set the prefix of the output files (default "out")
jpayne@69 97 -V
jpayne@69 98 --version Display the version information and exit
jpayne@69 99
jpayne@69 100
jpayne@69 101 -- NOTES --
jpayne@69 102 The -p option is recommended to avoid overwriting previous
jpayne@69 103 output. A simple naming convention is for files A.fna and B.fna, to
jpayne@69 104 set "-p A_B". It is safest to let dnadiff run nucmer automatically, so
jpayne@69 105 avoid using the -d option unless the delta file was already generated
jpayne@69 106 with "nucmer --maxmatch" and has not been filtered.
jpayne@69 107
jpayne@69 108
jpayne@69 109 -- OUTPUT FILES --
jpayne@69 110 dnadiff produces many outputs, however all but one are produced by
jpayne@69 111 other utilities in the MUMmer package. Please see their corresponding
jpayne@69 112 documentation for more information. This section will only describe
jpayne@69 113 the .report file generated by dnadiff and tips on interpreting it.
jpayne@69 114
jpayne@69 115
jpayne@69 116 *** .report OUTPUT ***
jpayne@69 117
jpayne@69 118 Report statistics are broken into two columns - reference and
jpayne@69 119 query. Rows are grouped by themed alignment metrics and are described
jpayne@69 120 here. Summary counts are estimates and do not represent the exact
jpayne@69 121 number of occurrences of a particular evolutionary event. When reading
jpayne@69 122 a reference column, think number of XYZ in reference with regard to
jpayne@69 123 the query. When reading a query column, think number of XYZ in query
jpayne@69 124 with regard to the reference.
jpayne@69 125
jpayne@69 126 [Sequences] - Sequence-centric stats.
jpayne@69 127 TotalSeqs - Total number of input sequences.
jpayne@69 128 AlignedSeqs - Number of input sequences with at least one alignment.
jpayne@69 129 UnalignedSeqs - Number of input sequences with no alignment.
jpayne@69 130
jpayne@69 131 [Bases] - Base-pair-centric stats.
jpayne@69 132 TotalBases - Total number of bases in the input sequences.
jpayne@69 133 AlignedBases - Total number of bases contained within an alignment.
jpayne@69 134 UnalignedBases - Total number of unaligned bases. This is a rough
jpayne@69 135 measure for the amount of "unique" sequence in the
jpayne@69 136 reference and query.
jpayne@69 137
jpayne@69 138 [Alignments] - Alignment-centric stats.
jpayne@69 139 1-to-1 - Number of alignment blocks comprising the 1-to-1
jpayne@69 140 mapping of reference to query. This is a subset of
jpayne@69 141 the M-to-M mapping, with repeats removed.
jpayne@69 142 TotalLength - Total length of 1-to-1 alignment blocks.
jpayne@69 143 AvgLength - Average length of 1-to-1 alignment blocks.
jpayne@69 144 AvgIdentity - Average identity of 1-to-1 alignment blocks.
jpayne@69 145
jpayne@69 146 M-to-M - Number of alignment blocks comprising the
jpayne@69 147 many-to-many mapping of reference to query. The
jpayne@69 148 M-to-M mapping represents the smallest set of
jpayne@69 149 alignments that maximize the coverage of both
jpayne@69 150 reference and query. This is a superset of the 1-to-1
jpayne@69 151 mapping.
jpayne@69 152 TotalLength - Total length of M-to-M alignment blocks.
jpayne@69 153 AvgLength - Average length of M-to-M alignment blocks.
jpayne@69 154 AvgIdentity - Average identity of M-to-M alignment blocks.
jpayne@69 155
jpayne@69 156 [Features] - Structural alignment features, such as
jpayne@69 157 rearrangements. These counts are rough estimates
jpayne@69 158 based on an automated analysis of the
jpayne@69 159 alignments. Features are identified by scanning the
jpayne@69 160 reference (or query) from low to high, and noting the
jpayne@69 161 positions where the query alignments are
jpayne@69 162 inconsistently ordered or oriented with respect to
jpayne@69 163 the reference.
jpayne@69 164 Breakpoints - Number of non-maximal alignment endpoints,
jpayne@69 165 i.e. endpoints that do not occur at the beginning or
jpayne@69 166 end of a sequence.
jpayne@69 167 Relocations - Number of breaks in the alignment where adjacent
jpayne@69 168 1-to-1 alignment blocks are in the same sequence, but
jpayne@69 169 not consistently ordered. A separate feature is
jpayne@69 170 recorded for each end of a relocation, so this is
jpayne@69 171 really a count of relocation endpoints.
jpayne@69 172 Translocations - Number of breaks in the alignment where adjacent
jpayne@69 173 1-to-1 alignment blocks are in different sequences. A
jpayne@69 174 separate feature is recorded for each end of a
jpayne@69 175 translocation, so this is really a count of
jpayne@69 176 translocation endpoints.
jpayne@69 177 Inversions - Number of breaks in the alignment where adjacent
jpayne@69 178 1-to-1 alignment blocks are inverted with respect to
jpayne@69 179 one another. A separate feature is recorded for each
jpayne@69 180 end of an inversion, so this is really a count of
jpayne@69 181 inversion endpoints.
jpayne@69 182
jpayne@69 183 Insertions - Rough count of insertion events. Note that this is
jpayne@69 184 slightly different from "UnalignedBases" because it
jpayne@69 185 counts duplications as insertions, whereas
jpayne@69 186 UnalignedBases does not. Also, this count does not
jpayne@69 187 included sequences that have no alignments as
jpayne@69 188 insertions, whereas UnalignedBases does. Note than
jpayne@69 189 insertions in R can be viewed as deletions from Q.
jpayne@69 190 This number reports only "major" insertions defined
jpayne@69 191 as insertions large enough to break an alignment.
jpayne@69 192 Nucmer will align through smaller insertions of less
jpayne@69 193 than ~60 bases. These smaller insertions are
jpayne@69 194 reported in the "Indels" count below.
jpayne@69 195 InsertionSum - Rough sum of inserted sequence.
jpayne@69 196 InsertionAvg - Average length of insertion.
jpayne@69 197
jpayne@69 198 TandemIns - Rough count of tandem duplication insertion
jpayne@69 199 events. Note that expansions in R can be viewed as
jpayne@69 200 collapses in Q.
jpayne@69 201 TandemInsSum - Rough sum of tandem duplication insertions.
jpayne@69 202 TandemInsAvg - Average length of tandem duplications.
jpayne@69 203
jpayne@69 204 [SNPs] - Single Nucleotide Polymorphism counts.
jpayne@69 205 TotalSNPs - Total number of SNPs, same for both sequences.
jpayne@69 206 XY - X-to-Y SNP. For reference column, this means
jpayne@69 207 reference 'X' to query 'Y'. For query column, this
jpayne@69 208 means query 'X' to reference 'Y'. The same
jpayne@69 209 convention applies below.
jpayne@69 210
jpayne@69 211 TotalGSNPs - Single Nucleotide Polymorphisms bounded by 20 exact,
jpayne@69 212 base-pair matches on both sides.
jpayne@69 213
jpayne@69 214 TotalIndels - Single Nucleotide Insertions/Deleltions.
jpayne@69 215 X. - X insertion. For reference column, 'X.' means
jpayne@69 216 insertion of 'X' in the reference. For query column,
jpayne@69 217 'X.' means insertion of 'X' in the query. Nucmer will
jpayne@69 218 align through group insertions of up to ~60 bases.
jpayne@69 219 Each base of these group insertions will be reported
jpayne@69 220 in this count. Large insertions will be reported in
jpayne@69 221 the "Insertions" count about.
jpayne@69 222
jpayne@69 223 TotalGIndels - Single Nucleotide Insertions/Deleltions bounded by 20
jpayne@69 224 exact, base-pair matches on both sides.