comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/maxmat3man.tex @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 %
2 % Copyright (c) 2003 by Stefan Kurtz and The Institute for
3 % Genomic Research. This is OSI Certified Open Source Software.
4 % Please see the file LICENSE for licensing information and
5 % the file ACKNOWLEDGEMENTS for names of contributors to the
6 % code base.
7 %
8
9 \documentclass[12pt]{article}
10 \usepackage{a4wide,alltt,xspace,times}
11 \usepackage{skaff}
12 \usepackage{optionman}
13 \newcommand{\MMthree}{\texttt{maxmat3}\xspace}
14 \newcommand{\MUM}[0]{\textit{MUM}\xspace}
15 \newcommand{\Ignore}[1]{}
16 \newcommand{\Subs}[3]{#1[#2\ldots#3]}
17 \newcommand{\Match}[3]{(#3,#1,#2)}
18
19 \author{Stefan Kurtz\thanks{\SKaffiliation} \thanks{Minor alterations
20 made by Adam Phillippy, The Institute for Genomic Research}}
21
22 \title{\textbf{A Program to find Maximal Matches}\\
23 \textbf{in Large Sequences}\\[2mm]
24 \textbf{Manual}\footnote{\copyright Stefan Kurtz and The
25 Institute for Genomic Research}}
26
27 \begin{document}
28 \maketitle
29
30 \MMthree is a program to find maximal matches of some minimum length
31 between a reference-sequence and a query-sequence. It also allows for
32 the computation of \MUM-candidates as well as \MUM{s}. This document
33 describes the options of \MMthree and the output format. In Section
34 \ref{SecBasicNotions}, we formally define some basic notions to clarify
35 the semantics of \MMthree. However, the reader should be able to understand
36 the manual without reading that section. In the following, the notion
37 \emph{match} always refers to \emph{maximal matches} if not stated
38 otherwise.
39
40 \section{The Program and its Options}
41
42 The program is called as follows:
43
44 $\MMthree~~[\mathit{options}]~~\mathit{referencefile}~~
45 \mathit{queryfile}_{1}~~\cdots~~\mathit{queryfile}_{n}$
46
47 And here is a description of the options:
48
49 \begin{list}{}{}
50
51 \Option{mum}{}{
52 Compute \MUM{s}, i.e.\ matches that are unique in both sequences.
53 }
54
55 \Option{mumreference}{}{
56 Compute \MUM-candidates, i.e.\ matches that are unique
57 in the reference-sequence but not necessarily in the query-sequence.
58 }
59
60 \Option{maxmatch}{}{
61 Compute all maximal matches, regardless of their uniqueness.
62 }
63
64 \Option{n}{}{
65 Only match the characters $a$, $c$, $g$, or $t$. They can be either
66 in upper or in lower case.
67 }
68
69 \Option{l}{$q$}{
70 Set the minimum length $q$ of a match to be reported. $q$ must be a
71 positive integer. Only matches of length at least $q$ are reported.
72 If this option is not used, then the default value for $q$ is 20.
73 }
74
75 \Option{b}{}{
76 Compute direct matches and reverse complement matches.
77 }
78
79 \Option{r}{}{
80 Only compute reverse complement matches.
81 }
82
83 \Option{s}{}{
84 Show the matching substring.
85 }
86
87 \Option{c}{}{
88 Report the query-position of a reverse complement match
89 relative to the original query-sequence. Suppose \(x\) is the
90 reference-sequence and \(y\) is the query-sequence. By definition, a
91 reverse complement match \(\Match{i}{j}{l}\) of \(x\) and \(y\) is a
92 match of \(x\) and \(\overline{y}\), where \(\overline{y}\) is the
93 reverse complement of \(y\), see Section \ref{SecBasicNotions}. By
94 default, a match is reported as a triple
95 \begin{alltt}
96 i j l
97 \end{alltt}
98 Position \(j\) is relative to \(\overline{y}\). With this option,
99 not \(j\) but \(m-j+1\) is reported, where \(m\) is the length of the
100 query-sequence. Now note that position \(m-j+1\) in \(y\) corresponds
101 to position \(j\) in \(\overline{y}\).
102 }
103
104 \Option{F}{}{
105 Forces 4 column output format regardless of the number of reference
106 sequence inputs, i.e. prefixes every output match with its reference
107 sequence identifier.
108 }
109
110 \Option{L}{}{
111 Show the length of the query-sequence on the header line.
112 }
113
114 \Option{h}{}{
115 Show the possible options and exit.
116 }
117 \Option{help}{}{
118 Show the possible options and exit.
119 }
120
121 \end{list}
122
123 Options \Showoption{mum}, \Showoption{mumreference} and
124 \Showoption{maxmatch} cannot be combined. If none of these options are
125 used, then the program will default to \Showoption{mumreference}.
126
127 Option \Showoption{b} and \Showoption{r} exclude each other. If none
128 of these two options is used, then only direct matches are reported.
129 Option \Showoption{c} can only be used in combination with
130 option \Showoption{b} or option \Showoption{r}.
131
132 There must be exactly one reference-file given and at least one
133 query-file. The maximum number of query-files is 32.
134 The reference-file and the query-files must be in multiple fasta format.
135 The query-files are processed one after the other. The uniqueness condition
136 for the \MUM{s} refers to the entire set of reference-sequences but
137 only to one query-sequence. That is, if a \MUM is reported, the
138 matching substring is unique in all reference-sequences and in the currently
139 processed query-sequence. The uniqueness does not necessarily extend to
140 all query-sequences.
141
142 Input sequences can be over any set of lower or upper case
143 characters. Thus DNA sequences as well as protein sequences are allowed.
144 The characters are processed case insensitive. That is, a
145 lower case character is identified with the corresponding upper case
146 character. The sequence output via option \Showoption{s} is always
147 in lower case. If option \Showoption{n} is used,
148 then all characters except for $a$, $c$, $g$, and $t$ are replaced by
149 a unique character which is different for the reference-sequences and
150 the query-sequences. This prevents false matches involving, for
151 example, the wildcard symbols $s$, $w$, $r$, $y$, $m$, $k$, $b$,
152 $d$, $h$, $v$, and $n$ often occurring in DNA sequences.
153 We therefore recommend to use the option \Showoption{n}.
154
155 \section{Output format}
156 Suppose we have two fasta files \texttt{Data/U89959} and
157 \texttt{Data/at.est}. The first file contains a complete BAC-sequence from
158 Arabidopsis thaliana, while the second is a collection of
159 ESTs from the same organism. We want to use the first file as our
160 reference-file and the second as out query-file.
161
162 Now let us look for direct matches and reverse complement
163 matches in the reference and in the query sequences.
164 The length of the matches should be at least 18 (options \Showoption{b}
165 and \Showoption{l} 18).
166 We want to report the following:
167 \begin{itemize}
168 \item
169 the length of the query-sequences (option \Showoption{L})
170 \item
171 the matching sequence (option \Showoption{s})
172 \item
173 the query-positions relative to the original sequence
174 (option \Showoption{c}).
175 \end{itemize}
176 We also do not want to report matches involving wildcards
177 (option \Showoption{n}). The corresponding program call is as follows:
178
179 \begin{verbatim}
180 maxmat3.x -b -l 18 -L -s -c -n Data/U89959 Data/at.est
181 \end{verbatim}
182
183 Here is a part of the output:
184
185 \begin{small}
186 \begin{verbatim}
187 # reading input file "Data/U89959" of length 106973
188 # construct suffix tree for sequence of length 106973
189 # (maximal input length is 536870908)
190 # process 1725 characters per dot
191 # ..............................................................
192 # CONSTRUCTIONTIME maxmat3.x Data/U89959 0.11
193 # reading input file "Data/at.est" of length 772376
194 # matching query-file "Data/at.est"
195 # against reference-file "Data/U89959"
196 > gi|5587835|gb|AF078689.1|AF078689 Len = 275
197 90201 258 18
198 taaaaaaaaaaaaaaaaa
199 52836 258 18
200 taaaaaaaaaaaaaaaaa
201 > gi|5587835|gb|AF078689.1|AF078689 Reverse Len = 275
202 > gi|4714033|dbj|C99914.1|C99914 Len = 628
203 > gi|4714033|dbj|C99914.1|C99914 Reverse Len = 628
204 > gi|4714032|dbj|C99913.1|C99913 Len = 497
205 > gi|4714032|dbj|C99913.1|C99913 Reverse Len = 497
206 > gi|4714031|dbj|C99911.1|C99911 Len = 661
207 > gi|4714031|dbj|C99911.1|C99911 Reverse Len = 661
208 > gi|4714030|dbj|C99910.1|C99910 Len = 241
209 5066 23 19
210 agaagaagaagaagaagaa
211 5069 23 19
212 agaagaagaagaagaagaa
213 5072 23 19
214 agaagaagaagaagaagaa
215 5075 23 19
216
217 .....
218
219 > gi|2763999|gb|R30040.1|R30040 Len = 475
220 > gi|2763999|gb|R30040.1|R30040 Reverse Len = 475
221 # COMPLETETIME maxmat3.x Data/U89959 1.64
222 # SPACE maxmat3.x Data/U89959 2.71
223 \end{verbatim}
224 \end{small}
225
226 The lines starting with the symbol \texttt{\symbol{35}} report some useful
227 information about the matching process. They tell which files are input
228 and the length of the scanned sequences. They also report
229 the used computational resources, i.e.\ the time required for
230 constructing the suffix tree, the total time of the matching process
231 (in seconds) and the space requirement (in megabytes).
232 The line starting with
233 \texttt{\symbol{35}} and containing only dots shows the progress of the
234 suffix tree construction. This is useful to estimate the remaining
235 running time for very long sequences.
236
237 For each query-sequence the corresponding header line is (up to the
238 first white-space character) reported in a line beginning with the
239 symbol \texttt{\symbol{62}}. The length of the
240 query-sequence appears at the end of such a line. Below the header line
241 all matches against the corresponding query sequence
242 are reported as a triple of three numbers.
243 \begin{itemize}
244 \item
245 The first number is the position of the match in the reference-sequence.
246 \item
247 The second number is the position of the match in the query-sequence.
248 \item
249 The third number is the length of the match.
250 \end{itemize}
251 Reverse complement matches are reported after a header line containing the
252 keyword \texttt{Reverse}. A matching sequence is reported in
253 a separate line following the line containing the positions and the
254 length of the match.
255
256 If the reference-file is a multiple fasta file with more than one
257 sequence, then it is necessary to also report the header line of the
258 reference-sequence and the position of the match relative to the start
259 of the reference-sequence. This leads to a slightly different format
260 where each line containing the positions is prepended by the header of
261 the reference-sequence (also available with the \Showoption{F}
262 option). This format is shown in the following example, where we use
263 two files containing protein sequences: The file \texttt{Data/prot1}
264 is a file containing one protein sequence, while \texttt{Data/prot2}
265 contains many protein sequences: We use \texttt{Data/prot2} as our
266 reference-file and \texttt{Data/prot1} as our query-file:
267
268 \begin{verbatim}
269 maxmat3.x -L -l 7 Data/prot2 Data/prot1
270 \end{verbatim}
271
272 The output is as follows:
273
274 \begin{small}
275 \begin{verbatim}
276 # reading input file "Data/prot2" of length 40839
277 # construct suffix tree for sequence of length 40839
278 # (maximum input length is 536870908)
279 # CONSTRUCTIONTIME maxmat3.x Data/prot2 0.02
280 # reading input file "Data/prot1" of length 45543
281 # matching query-file "Data/prot1"
282 # against reference-file "Data/prot2"
283 > some_collection Len = 45543
284 sp|ALL2_BOVIN|ALLERGEN 139 1155 7
285 sp|ALG3_YEAST|PUTATIVE 425 2904 7
286 sp|ALF_TRYBB|FRUCTOSE-BISPHOSPHATE 177 4794 7
287 sp|ALGB_YEAST|ALG11 114 12599 7
288 sp|ALR_SYNY3|ALANINE 354 33261 7
289 sp|ALGB_PSEAE|ALGINATE 212 40991 7
290 sp|ALGB_PSEAE|ALGINATE 314 41093 7
291 sp|ALGB_PSEAE|ALGINATE 360 41138 7
292 sp|ALR_SYNY3|ALANINE 79 44186 7
293 sp|ALP_CEPAC|ALKALINE 366 45304 7
294 # COMPLETETIME maxmat3.x Data/prot2 0.06
295 # SPACE maxmat3.x Data/prot2 0.56
296 \end{verbatim}
297 \end{small}
298
299 \section{Basic Notions}\label{SecBasicNotions}
300 We assume that \(x\) is a sequence of length \(n\geq 1\) over some
301 alphabet \(\Sigma\).
302 $x[i]$ denotes the character at position $i$ in $x$,
303 for $i\in[1,n]$. For $i\leq j$, $\Subs{x}{i}{j}$ denotes the
304 substring of $x$ starting with the character at position $i$
305 and ending with the character at position $j$ in \(x\). If \(i>j\), then
306 \(\Subs{x}{i}{j}\) is the empty sequence.
307
308 Let \(y\) be a sequence of length \(m\). In this document we refer to
309 \(x\) as the \emph{reference-sequence} and to \(y\) as the
310 \emph{query-sequence}.
311 Let \(\ell>0\), \(i\in[1,n-\ell+1]\), and \(j\in[1,m-\ell+1]\).
312 \(\Match{i}{j}{\ell}\) is a \emph{match} of \(x\) and \(y\) if and only if
313 \(\Subs{x}{i}{i+\ell-1}=\Subs{y}{j}{j+\ell-1}\).
314 \(\ell\) is the \emph{length of the match} and \(\Subs{x}{i}{i+\ell-1}\)
315 is the \emph{matching substring}. Note that a match is contained in a
316 longer match if the characters to the left or to the right of the
317 occurrences in \(x\) and \(y\) are identical. To reduce redundancy,
318 we restrict attention to maximal matches.
319 A match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) is \emph{maximal} if
320 and only if the following holds:
321 \begin{itemize}
322 \item
323 \(i=1\) or \(j=1\) or \(x[i-1]\neq y[j-1]\)\hfill(left maximality)
324 \item
325 \(i+\ell=n+1\) or \(j+\ell=m+1\) or \(x[i+\ell]\neq
326 y[j+\ell]\)\hfill(right maximality)
327 \end{itemize}
328 A \emph{maximal unique match} of \(x\) and \(y\) (\MUM for short) is a
329 maximal match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) such that the
330 matching substring \(\Subs{x}{i}{i+\ell-1}\)
331 occurs exactly once in \(x\) and once in \(y\). A \emph{\MUM-candidate}
332 is a maximal match \(\Match{i}{j}{l}\) of \(x\) and \(y\)
333 such that the matching substring
334 \(\Subs{x}{i}{i+\ell-1}\) occurs exactly once in \(x\). Note that
335 any \MUM is also a \MUM-candidate, but not vice versa, since for a
336 \MUM-candidate the matching substring may occur more than once in \(y\).
337
338 If \(\Sigma\) is the DNA alphabet with the characters \(a,c,g,t\), then we
339 define a function \(wcc\) over \(\Sigma\) by the following equations:
340 \begin{eqnarray*}
341 wcc(a)&=&t\\
342 wcc(g)&=&c\\
343 wcc(c)&=&g\\
344 wcc(t)&=&a
345 \end{eqnarray*}
346 The reverse complement of
347 a DNA-sequence \(u=\Subs{u}{1}{k}\) is the sequence
348 \[wcc(u[k])wcc(u[k-1])\ldots wcc(u[2])wcc(u[1])\]
349 denoted by \(\overline{u}\).
350 A \emph{reverse complement match} of \(x\) and \(y\) is a
351 match of \(x\) and \(\overline{y}\). The notion of maximality
352 naturally extends to reverse complement matches. To distinguish reverse
353 complement matches from matches we often call the latter direct matches.
354 \end{document}