Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/maxmat3man.tex @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 69:33d812a61356 |
---|---|
1 % | |
2 % Copyright (c) 2003 by Stefan Kurtz and The Institute for | |
3 % Genomic Research. This is OSI Certified Open Source Software. | |
4 % Please see the file LICENSE for licensing information and | |
5 % the file ACKNOWLEDGEMENTS for names of contributors to the | |
6 % code base. | |
7 % | |
8 | |
9 \documentclass[12pt]{article} | |
10 \usepackage{a4wide,alltt,xspace,times} | |
11 \usepackage{skaff} | |
12 \usepackage{optionman} | |
13 \newcommand{\MMthree}{\texttt{maxmat3}\xspace} | |
14 \newcommand{\MUM}[0]{\textit{MUM}\xspace} | |
15 \newcommand{\Ignore}[1]{} | |
16 \newcommand{\Subs}[3]{#1[#2\ldots#3]} | |
17 \newcommand{\Match}[3]{(#3,#1,#2)} | |
18 | |
19 \author{Stefan Kurtz\thanks{\SKaffiliation} \thanks{Minor alterations | |
20 made by Adam Phillippy, The Institute for Genomic Research}} | |
21 | |
22 \title{\textbf{A Program to find Maximal Matches}\\ | |
23 \textbf{in Large Sequences}\\[2mm] | |
24 \textbf{Manual}\footnote{\copyright Stefan Kurtz and The | |
25 Institute for Genomic Research}} | |
26 | |
27 \begin{document} | |
28 \maketitle | |
29 | |
30 \MMthree is a program to find maximal matches of some minimum length | |
31 between a reference-sequence and a query-sequence. It also allows for | |
32 the computation of \MUM-candidates as well as \MUM{s}. This document | |
33 describes the options of \MMthree and the output format. In Section | |
34 \ref{SecBasicNotions}, we formally define some basic notions to clarify | |
35 the semantics of \MMthree. However, the reader should be able to understand | |
36 the manual without reading that section. In the following, the notion | |
37 \emph{match} always refers to \emph{maximal matches} if not stated | |
38 otherwise. | |
39 | |
40 \section{The Program and its Options} | |
41 | |
42 The program is called as follows: | |
43 | |
44 $\MMthree~~[\mathit{options}]~~\mathit{referencefile}~~ | |
45 \mathit{queryfile}_{1}~~\cdots~~\mathit{queryfile}_{n}$ | |
46 | |
47 And here is a description of the options: | |
48 | |
49 \begin{list}{}{} | |
50 | |
51 \Option{mum}{}{ | |
52 Compute \MUM{s}, i.e.\ matches that are unique in both sequences. | |
53 } | |
54 | |
55 \Option{mumreference}{}{ | |
56 Compute \MUM-candidates, i.e.\ matches that are unique | |
57 in the reference-sequence but not necessarily in the query-sequence. | |
58 } | |
59 | |
60 \Option{maxmatch}{}{ | |
61 Compute all maximal matches, regardless of their uniqueness. | |
62 } | |
63 | |
64 \Option{n}{}{ | |
65 Only match the characters $a$, $c$, $g$, or $t$. They can be either | |
66 in upper or in lower case. | |
67 } | |
68 | |
69 \Option{l}{$q$}{ | |
70 Set the minimum length $q$ of a match to be reported. $q$ must be a | |
71 positive integer. Only matches of length at least $q$ are reported. | |
72 If this option is not used, then the default value for $q$ is 20. | |
73 } | |
74 | |
75 \Option{b}{}{ | |
76 Compute direct matches and reverse complement matches. | |
77 } | |
78 | |
79 \Option{r}{}{ | |
80 Only compute reverse complement matches. | |
81 } | |
82 | |
83 \Option{s}{}{ | |
84 Show the matching substring. | |
85 } | |
86 | |
87 \Option{c}{}{ | |
88 Report the query-position of a reverse complement match | |
89 relative to the original query-sequence. Suppose \(x\) is the | |
90 reference-sequence and \(y\) is the query-sequence. By definition, a | |
91 reverse complement match \(\Match{i}{j}{l}\) of \(x\) and \(y\) is a | |
92 match of \(x\) and \(\overline{y}\), where \(\overline{y}\) is the | |
93 reverse complement of \(y\), see Section \ref{SecBasicNotions}. By | |
94 default, a match is reported as a triple | |
95 \begin{alltt} | |
96 i j l | |
97 \end{alltt} | |
98 Position \(j\) is relative to \(\overline{y}\). With this option, | |
99 not \(j\) but \(m-j+1\) is reported, where \(m\) is the length of the | |
100 query-sequence. Now note that position \(m-j+1\) in \(y\) corresponds | |
101 to position \(j\) in \(\overline{y}\). | |
102 } | |
103 | |
104 \Option{F}{}{ | |
105 Forces 4 column output format regardless of the number of reference | |
106 sequence inputs, i.e. prefixes every output match with its reference | |
107 sequence identifier. | |
108 } | |
109 | |
110 \Option{L}{}{ | |
111 Show the length of the query-sequence on the header line. | |
112 } | |
113 | |
114 \Option{h}{}{ | |
115 Show the possible options and exit. | |
116 } | |
117 \Option{help}{}{ | |
118 Show the possible options and exit. | |
119 } | |
120 | |
121 \end{list} | |
122 | |
123 Options \Showoption{mum}, \Showoption{mumreference} and | |
124 \Showoption{maxmatch} cannot be combined. If none of these options are | |
125 used, then the program will default to \Showoption{mumreference}. | |
126 | |
127 Option \Showoption{b} and \Showoption{r} exclude each other. If none | |
128 of these two options is used, then only direct matches are reported. | |
129 Option \Showoption{c} can only be used in combination with | |
130 option \Showoption{b} or option \Showoption{r}. | |
131 | |
132 There must be exactly one reference-file given and at least one | |
133 query-file. The maximum number of query-files is 32. | |
134 The reference-file and the query-files must be in multiple fasta format. | |
135 The query-files are processed one after the other. The uniqueness condition | |
136 for the \MUM{s} refers to the entire set of reference-sequences but | |
137 only to one query-sequence. That is, if a \MUM is reported, the | |
138 matching substring is unique in all reference-sequences and in the currently | |
139 processed query-sequence. The uniqueness does not necessarily extend to | |
140 all query-sequences. | |
141 | |
142 Input sequences can be over any set of lower or upper case | |
143 characters. Thus DNA sequences as well as protein sequences are allowed. | |
144 The characters are processed case insensitive. That is, a | |
145 lower case character is identified with the corresponding upper case | |
146 character. The sequence output via option \Showoption{s} is always | |
147 in lower case. If option \Showoption{n} is used, | |
148 then all characters except for $a$, $c$, $g$, and $t$ are replaced by | |
149 a unique character which is different for the reference-sequences and | |
150 the query-sequences. This prevents false matches involving, for | |
151 example, the wildcard symbols $s$, $w$, $r$, $y$, $m$, $k$, $b$, | |
152 $d$, $h$, $v$, and $n$ often occurring in DNA sequences. | |
153 We therefore recommend to use the option \Showoption{n}. | |
154 | |
155 \section{Output format} | |
156 Suppose we have two fasta files \texttt{Data/U89959} and | |
157 \texttt{Data/at.est}. The first file contains a complete BAC-sequence from | |
158 Arabidopsis thaliana, while the second is a collection of | |
159 ESTs from the same organism. We want to use the first file as our | |
160 reference-file and the second as out query-file. | |
161 | |
162 Now let us look for direct matches and reverse complement | |
163 matches in the reference and in the query sequences. | |
164 The length of the matches should be at least 18 (options \Showoption{b} | |
165 and \Showoption{l} 18). | |
166 We want to report the following: | |
167 \begin{itemize} | |
168 \item | |
169 the length of the query-sequences (option \Showoption{L}) | |
170 \item | |
171 the matching sequence (option \Showoption{s}) | |
172 \item | |
173 the query-positions relative to the original sequence | |
174 (option \Showoption{c}). | |
175 \end{itemize} | |
176 We also do not want to report matches involving wildcards | |
177 (option \Showoption{n}). The corresponding program call is as follows: | |
178 | |
179 \begin{verbatim} | |
180 maxmat3.x -b -l 18 -L -s -c -n Data/U89959 Data/at.est | |
181 \end{verbatim} | |
182 | |
183 Here is a part of the output: | |
184 | |
185 \begin{small} | |
186 \begin{verbatim} | |
187 # reading input file "Data/U89959" of length 106973 | |
188 # construct suffix tree for sequence of length 106973 | |
189 # (maximal input length is 536870908) | |
190 # process 1725 characters per dot | |
191 # .............................................................. | |
192 # CONSTRUCTIONTIME maxmat3.x Data/U89959 0.11 | |
193 # reading input file "Data/at.est" of length 772376 | |
194 # matching query-file "Data/at.est" | |
195 # against reference-file "Data/U89959" | |
196 > gi|5587835|gb|AF078689.1|AF078689 Len = 275 | |
197 90201 258 18 | |
198 taaaaaaaaaaaaaaaaa | |
199 52836 258 18 | |
200 taaaaaaaaaaaaaaaaa | |
201 > gi|5587835|gb|AF078689.1|AF078689 Reverse Len = 275 | |
202 > gi|4714033|dbj|C99914.1|C99914 Len = 628 | |
203 > gi|4714033|dbj|C99914.1|C99914 Reverse Len = 628 | |
204 > gi|4714032|dbj|C99913.1|C99913 Len = 497 | |
205 > gi|4714032|dbj|C99913.1|C99913 Reverse Len = 497 | |
206 > gi|4714031|dbj|C99911.1|C99911 Len = 661 | |
207 > gi|4714031|dbj|C99911.1|C99911 Reverse Len = 661 | |
208 > gi|4714030|dbj|C99910.1|C99910 Len = 241 | |
209 5066 23 19 | |
210 agaagaagaagaagaagaa | |
211 5069 23 19 | |
212 agaagaagaagaagaagaa | |
213 5072 23 19 | |
214 agaagaagaagaagaagaa | |
215 5075 23 19 | |
216 | |
217 ..... | |
218 | |
219 > gi|2763999|gb|R30040.1|R30040 Len = 475 | |
220 > gi|2763999|gb|R30040.1|R30040 Reverse Len = 475 | |
221 # COMPLETETIME maxmat3.x Data/U89959 1.64 | |
222 # SPACE maxmat3.x Data/U89959 2.71 | |
223 \end{verbatim} | |
224 \end{small} | |
225 | |
226 The lines starting with the symbol \texttt{\symbol{35}} report some useful | |
227 information about the matching process. They tell which files are input | |
228 and the length of the scanned sequences. They also report | |
229 the used computational resources, i.e.\ the time required for | |
230 constructing the suffix tree, the total time of the matching process | |
231 (in seconds) and the space requirement (in megabytes). | |
232 The line starting with | |
233 \texttt{\symbol{35}} and containing only dots shows the progress of the | |
234 suffix tree construction. This is useful to estimate the remaining | |
235 running time for very long sequences. | |
236 | |
237 For each query-sequence the corresponding header line is (up to the | |
238 first white-space character) reported in a line beginning with the | |
239 symbol \texttt{\symbol{62}}. The length of the | |
240 query-sequence appears at the end of such a line. Below the header line | |
241 all matches against the corresponding query sequence | |
242 are reported as a triple of three numbers. | |
243 \begin{itemize} | |
244 \item | |
245 The first number is the position of the match in the reference-sequence. | |
246 \item | |
247 The second number is the position of the match in the query-sequence. | |
248 \item | |
249 The third number is the length of the match. | |
250 \end{itemize} | |
251 Reverse complement matches are reported after a header line containing the | |
252 keyword \texttt{Reverse}. A matching sequence is reported in | |
253 a separate line following the line containing the positions and the | |
254 length of the match. | |
255 | |
256 If the reference-file is a multiple fasta file with more than one | |
257 sequence, then it is necessary to also report the header line of the | |
258 reference-sequence and the position of the match relative to the start | |
259 of the reference-sequence. This leads to a slightly different format | |
260 where each line containing the positions is prepended by the header of | |
261 the reference-sequence (also available with the \Showoption{F} | |
262 option). This format is shown in the following example, where we use | |
263 two files containing protein sequences: The file \texttt{Data/prot1} | |
264 is a file containing one protein sequence, while \texttt{Data/prot2} | |
265 contains many protein sequences: We use \texttt{Data/prot2} as our | |
266 reference-file and \texttt{Data/prot1} as our query-file: | |
267 | |
268 \begin{verbatim} | |
269 maxmat3.x -L -l 7 Data/prot2 Data/prot1 | |
270 \end{verbatim} | |
271 | |
272 The output is as follows: | |
273 | |
274 \begin{small} | |
275 \begin{verbatim} | |
276 # reading input file "Data/prot2" of length 40839 | |
277 # construct suffix tree for sequence of length 40839 | |
278 # (maximum input length is 536870908) | |
279 # CONSTRUCTIONTIME maxmat3.x Data/prot2 0.02 | |
280 # reading input file "Data/prot1" of length 45543 | |
281 # matching query-file "Data/prot1" | |
282 # against reference-file "Data/prot2" | |
283 > some_collection Len = 45543 | |
284 sp|ALL2_BOVIN|ALLERGEN 139 1155 7 | |
285 sp|ALG3_YEAST|PUTATIVE 425 2904 7 | |
286 sp|ALF_TRYBB|FRUCTOSE-BISPHOSPHATE 177 4794 7 | |
287 sp|ALGB_YEAST|ALG11 114 12599 7 | |
288 sp|ALR_SYNY3|ALANINE 354 33261 7 | |
289 sp|ALGB_PSEAE|ALGINATE 212 40991 7 | |
290 sp|ALGB_PSEAE|ALGINATE 314 41093 7 | |
291 sp|ALGB_PSEAE|ALGINATE 360 41138 7 | |
292 sp|ALR_SYNY3|ALANINE 79 44186 7 | |
293 sp|ALP_CEPAC|ALKALINE 366 45304 7 | |
294 # COMPLETETIME maxmat3.x Data/prot2 0.06 | |
295 # SPACE maxmat3.x Data/prot2 0.56 | |
296 \end{verbatim} | |
297 \end{small} | |
298 | |
299 \section{Basic Notions}\label{SecBasicNotions} | |
300 We assume that \(x\) is a sequence of length \(n\geq 1\) over some | |
301 alphabet \(\Sigma\). | |
302 $x[i]$ denotes the character at position $i$ in $x$, | |
303 for $i\in[1,n]$. For $i\leq j$, $\Subs{x}{i}{j}$ denotes the | |
304 substring of $x$ starting with the character at position $i$ | |
305 and ending with the character at position $j$ in \(x\). If \(i>j\), then | |
306 \(\Subs{x}{i}{j}\) is the empty sequence. | |
307 | |
308 Let \(y\) be a sequence of length \(m\). In this document we refer to | |
309 \(x\) as the \emph{reference-sequence} and to \(y\) as the | |
310 \emph{query-sequence}. | |
311 Let \(\ell>0\), \(i\in[1,n-\ell+1]\), and \(j\in[1,m-\ell+1]\). | |
312 \(\Match{i}{j}{\ell}\) is a \emph{match} of \(x\) and \(y\) if and only if | |
313 \(\Subs{x}{i}{i+\ell-1}=\Subs{y}{j}{j+\ell-1}\). | |
314 \(\ell\) is the \emph{length of the match} and \(\Subs{x}{i}{i+\ell-1}\) | |
315 is the \emph{matching substring}. Note that a match is contained in a | |
316 longer match if the characters to the left or to the right of the | |
317 occurrences in \(x\) and \(y\) are identical. To reduce redundancy, | |
318 we restrict attention to maximal matches. | |
319 A match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) is \emph{maximal} if | |
320 and only if the following holds: | |
321 \begin{itemize} | |
322 \item | |
323 \(i=1\) or \(j=1\) or \(x[i-1]\neq y[j-1]\)\hfill(left maximality) | |
324 \item | |
325 \(i+\ell=n+1\) or \(j+\ell=m+1\) or \(x[i+\ell]\neq | |
326 y[j+\ell]\)\hfill(right maximality) | |
327 \end{itemize} | |
328 A \emph{maximal unique match} of \(x\) and \(y\) (\MUM for short) is a | |
329 maximal match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) such that the | |
330 matching substring \(\Subs{x}{i}{i+\ell-1}\) | |
331 occurs exactly once in \(x\) and once in \(y\). A \emph{\MUM-candidate} | |
332 is a maximal match \(\Match{i}{j}{l}\) of \(x\) and \(y\) | |
333 such that the matching substring | |
334 \(\Subs{x}{i}{i+\ell-1}\) occurs exactly once in \(x\). Note that | |
335 any \MUM is also a \MUM-candidate, but not vice versa, since for a | |
336 \MUM-candidate the matching substring may occur more than once in \(y\). | |
337 | |
338 If \(\Sigma\) is the DNA alphabet with the characters \(a,c,g,t\), then we | |
339 define a function \(wcc\) over \(\Sigma\) by the following equations: | |
340 \begin{eqnarray*} | |
341 wcc(a)&=&t\\ | |
342 wcc(g)&=&c\\ | |
343 wcc(c)&=&g\\ | |
344 wcc(t)&=&a | |
345 \end{eqnarray*} | |
346 The reverse complement of | |
347 a DNA-sequence \(u=\Subs{u}{1}{k}\) is the sequence | |
348 \[wcc(u[k])wcc(u[k-1])\ldots wcc(u[2])wcc(u[1])\] | |
349 denoted by \(\overline{u}\). | |
350 A \emph{reverse complement match} of \(x\) and \(y\) is a | |
351 match of \(x\) and \(\overline{y}\). The notion of maximality | |
352 naturally extends to reverse complement matches. To distinguish reverse | |
353 complement matches from matches we often call the latter direct matches. | |
354 \end{document} |