Mercurial > repos > rliterman > csp2
view CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/mummer-3.23/docs/maxmat3man.tex @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
line wrap: on
line source
% % Copyright (c) 2003 by Stefan Kurtz and The Institute for % Genomic Research. This is OSI Certified Open Source Software. % Please see the file LICENSE for licensing information and % the file ACKNOWLEDGEMENTS for names of contributors to the % code base. % \documentclass[12pt]{article} \usepackage{a4wide,alltt,xspace,times} \usepackage{skaff} \usepackage{optionman} \newcommand{\MMthree}{\texttt{maxmat3}\xspace} \newcommand{\MUM}[0]{\textit{MUM}\xspace} \newcommand{\Ignore}[1]{} \newcommand{\Subs}[3]{#1[#2\ldots#3]} \newcommand{\Match}[3]{(#3,#1,#2)} \author{Stefan Kurtz\thanks{\SKaffiliation} \thanks{Minor alterations made by Adam Phillippy, The Institute for Genomic Research}} \title{\textbf{A Program to find Maximal Matches}\\ \textbf{in Large Sequences}\\[2mm] \textbf{Manual}\footnote{\copyright Stefan Kurtz and The Institute for Genomic Research}} \begin{document} \maketitle \MMthree is a program to find maximal matches of some minimum length between a reference-sequence and a query-sequence. It also allows for the computation of \MUM-candidates as well as \MUM{s}. This document describes the options of \MMthree and the output format. In Section \ref{SecBasicNotions}, we formally define some basic notions to clarify the semantics of \MMthree. However, the reader should be able to understand the manual without reading that section. In the following, the notion \emph{match} always refers to \emph{maximal matches} if not stated otherwise. \section{The Program and its Options} The program is called as follows: $\MMthree~~[\mathit{options}]~~\mathit{referencefile}~~ \mathit{queryfile}_{1}~~\cdots~~\mathit{queryfile}_{n}$ And here is a description of the options: \begin{list}{}{} \Option{mum}{}{ Compute \MUM{s}, i.e.\ matches that are unique in both sequences. } \Option{mumreference}{}{ Compute \MUM-candidates, i.e.\ matches that are unique in the reference-sequence but not necessarily in the query-sequence. } \Option{maxmatch}{}{ Compute all maximal matches, regardless of their uniqueness. } \Option{n}{}{ Only match the characters $a$, $c$, $g$, or $t$. They can be either in upper or in lower case. } \Option{l}{$q$}{ Set the minimum length $q$ of a match to be reported. $q$ must be a positive integer. Only matches of length at least $q$ are reported. If this option is not used, then the default value for $q$ is 20. } \Option{b}{}{ Compute direct matches and reverse complement matches. } \Option{r}{}{ Only compute reverse complement matches. } \Option{s}{}{ Show the matching substring. } \Option{c}{}{ Report the query-position of a reverse complement match relative to the original query-sequence. Suppose \(x\) is the reference-sequence and \(y\) is the query-sequence. By definition, a reverse complement match \(\Match{i}{j}{l}\) of \(x\) and \(y\) is a match of \(x\) and \(\overline{y}\), where \(\overline{y}\) is the reverse complement of \(y\), see Section \ref{SecBasicNotions}. By default, a match is reported as a triple \begin{alltt} i j l \end{alltt} Position \(j\) is relative to \(\overline{y}\). With this option, not \(j\) but \(m-j+1\) is reported, where \(m\) is the length of the query-sequence. Now note that position \(m-j+1\) in \(y\) corresponds to position \(j\) in \(\overline{y}\). } \Option{F}{}{ Forces 4 column output format regardless of the number of reference sequence inputs, i.e. prefixes every output match with its reference sequence identifier. } \Option{L}{}{ Show the length of the query-sequence on the header line. } \Option{h}{}{ Show the possible options and exit. } \Option{help}{}{ Show the possible options and exit. } \end{list} Options \Showoption{mum}, \Showoption{mumreference} and \Showoption{maxmatch} cannot be combined. If none of these options are used, then the program will default to \Showoption{mumreference}. Option \Showoption{b} and \Showoption{r} exclude each other. If none of these two options is used, then only direct matches are reported. Option \Showoption{c} can only be used in combination with option \Showoption{b} or option \Showoption{r}. There must be exactly one reference-file given and at least one query-file. The maximum number of query-files is 32. The reference-file and the query-files must be in multiple fasta format. The query-files are processed one after the other. The uniqueness condition for the \MUM{s} refers to the entire set of reference-sequences but only to one query-sequence. That is, if a \MUM is reported, the matching substring is unique in all reference-sequences and in the currently processed query-sequence. The uniqueness does not necessarily extend to all query-sequences. Input sequences can be over any set of lower or upper case characters. Thus DNA sequences as well as protein sequences are allowed. The characters are processed case insensitive. That is, a lower case character is identified with the corresponding upper case character. The sequence output via option \Showoption{s} is always in lower case. If option \Showoption{n} is used, then all characters except for $a$, $c$, $g$, and $t$ are replaced by a unique character which is different for the reference-sequences and the query-sequences. This prevents false matches involving, for example, the wildcard symbols $s$, $w$, $r$, $y$, $m$, $k$, $b$, $d$, $h$, $v$, and $n$ often occurring in DNA sequences. We therefore recommend to use the option \Showoption{n}. \section{Output format} Suppose we have two fasta files \texttt{Data/U89959} and \texttt{Data/at.est}. The first file contains a complete BAC-sequence from Arabidopsis thaliana, while the second is a collection of ESTs from the same organism. We want to use the first file as our reference-file and the second as out query-file. Now let us look for direct matches and reverse complement matches in the reference and in the query sequences. The length of the matches should be at least 18 (options \Showoption{b} and \Showoption{l} 18). We want to report the following: \begin{itemize} \item the length of the query-sequences (option \Showoption{L}) \item the matching sequence (option \Showoption{s}) \item the query-positions relative to the original sequence (option \Showoption{c}). \end{itemize} We also do not want to report matches involving wildcards (option \Showoption{n}). The corresponding program call is as follows: \begin{verbatim} maxmat3.x -b -l 18 -L -s -c -n Data/U89959 Data/at.est \end{verbatim} Here is a part of the output: \begin{small} \begin{verbatim} # reading input file "Data/U89959" of length 106973 # construct suffix tree for sequence of length 106973 # (maximal input length is 536870908) # process 1725 characters per dot # .............................................................. # CONSTRUCTIONTIME maxmat3.x Data/U89959 0.11 # reading input file "Data/at.est" of length 772376 # matching query-file "Data/at.est" # against reference-file "Data/U89959" > gi|5587835|gb|AF078689.1|AF078689 Len = 275 90201 258 18 taaaaaaaaaaaaaaaaa 52836 258 18 taaaaaaaaaaaaaaaaa > gi|5587835|gb|AF078689.1|AF078689 Reverse Len = 275 > gi|4714033|dbj|C99914.1|C99914 Len = 628 > gi|4714033|dbj|C99914.1|C99914 Reverse Len = 628 > gi|4714032|dbj|C99913.1|C99913 Len = 497 > gi|4714032|dbj|C99913.1|C99913 Reverse Len = 497 > gi|4714031|dbj|C99911.1|C99911 Len = 661 > gi|4714031|dbj|C99911.1|C99911 Reverse Len = 661 > gi|4714030|dbj|C99910.1|C99910 Len = 241 5066 23 19 agaagaagaagaagaagaa 5069 23 19 agaagaagaagaagaagaa 5072 23 19 agaagaagaagaagaagaa 5075 23 19 ..... > gi|2763999|gb|R30040.1|R30040 Len = 475 > gi|2763999|gb|R30040.1|R30040 Reverse Len = 475 # COMPLETETIME maxmat3.x Data/U89959 1.64 # SPACE maxmat3.x Data/U89959 2.71 \end{verbatim} \end{small} The lines starting with the symbol \texttt{\symbol{35}} report some useful information about the matching process. They tell which files are input and the length of the scanned sequences. They also report the used computational resources, i.e.\ the time required for constructing the suffix tree, the total time of the matching process (in seconds) and the space requirement (in megabytes). The line starting with \texttt{\symbol{35}} and containing only dots shows the progress of the suffix tree construction. This is useful to estimate the remaining running time for very long sequences. For each query-sequence the corresponding header line is (up to the first white-space character) reported in a line beginning with the symbol \texttt{\symbol{62}}. The length of the query-sequence appears at the end of such a line. Below the header line all matches against the corresponding query sequence are reported as a triple of three numbers. \begin{itemize} \item The first number is the position of the match in the reference-sequence. \item The second number is the position of the match in the query-sequence. \item The third number is the length of the match. \end{itemize} Reverse complement matches are reported after a header line containing the keyword \texttt{Reverse}. A matching sequence is reported in a separate line following the line containing the positions and the length of the match. If the reference-file is a multiple fasta file with more than one sequence, then it is necessary to also report the header line of the reference-sequence and the position of the match relative to the start of the reference-sequence. This leads to a slightly different format where each line containing the positions is prepended by the header of the reference-sequence (also available with the \Showoption{F} option). This format is shown in the following example, where we use two files containing protein sequences: The file \texttt{Data/prot1} is a file containing one protein sequence, while \texttt{Data/prot2} contains many protein sequences: We use \texttt{Data/prot2} as our reference-file and \texttt{Data/prot1} as our query-file: \begin{verbatim} maxmat3.x -L -l 7 Data/prot2 Data/prot1 \end{verbatim} The output is as follows: \begin{small} \begin{verbatim} # reading input file "Data/prot2" of length 40839 # construct suffix tree for sequence of length 40839 # (maximum input length is 536870908) # CONSTRUCTIONTIME maxmat3.x Data/prot2 0.02 # reading input file "Data/prot1" of length 45543 # matching query-file "Data/prot1" # against reference-file "Data/prot2" > some_collection Len = 45543 sp|ALL2_BOVIN|ALLERGEN 139 1155 7 sp|ALG3_YEAST|PUTATIVE 425 2904 7 sp|ALF_TRYBB|FRUCTOSE-BISPHOSPHATE 177 4794 7 sp|ALGB_YEAST|ALG11 114 12599 7 sp|ALR_SYNY3|ALANINE 354 33261 7 sp|ALGB_PSEAE|ALGINATE 212 40991 7 sp|ALGB_PSEAE|ALGINATE 314 41093 7 sp|ALGB_PSEAE|ALGINATE 360 41138 7 sp|ALR_SYNY3|ALANINE 79 44186 7 sp|ALP_CEPAC|ALKALINE 366 45304 7 # COMPLETETIME maxmat3.x Data/prot2 0.06 # SPACE maxmat3.x Data/prot2 0.56 \end{verbatim} \end{small} \section{Basic Notions}\label{SecBasicNotions} We assume that \(x\) is a sequence of length \(n\geq 1\) over some alphabet \(\Sigma\). $x[i]$ denotes the character at position $i$ in $x$, for $i\in[1,n]$. For $i\leq j$, $\Subs{x}{i}{j}$ denotes the substring of $x$ starting with the character at position $i$ and ending with the character at position $j$ in \(x\). If \(i>j\), then \(\Subs{x}{i}{j}\) is the empty sequence. Let \(y\) be a sequence of length \(m\). In this document we refer to \(x\) as the \emph{reference-sequence} and to \(y\) as the \emph{query-sequence}. Let \(\ell>0\), \(i\in[1,n-\ell+1]\), and \(j\in[1,m-\ell+1]\). \(\Match{i}{j}{\ell}\) is a \emph{match} of \(x\) and \(y\) if and only if \(\Subs{x}{i}{i+\ell-1}=\Subs{y}{j}{j+\ell-1}\). \(\ell\) is the \emph{length of the match} and \(\Subs{x}{i}{i+\ell-1}\) is the \emph{matching substring}. Note that a match is contained in a longer match if the characters to the left or to the right of the occurrences in \(x\) and \(y\) are identical. To reduce redundancy, we restrict attention to maximal matches. A match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) is \emph{maximal} if and only if the following holds: \begin{itemize} \item \(i=1\) or \(j=1\) or \(x[i-1]\neq y[j-1]\)\hfill(left maximality) \item \(i+\ell=n+1\) or \(j+\ell=m+1\) or \(x[i+\ell]\neq y[j+\ell]\)\hfill(right maximality) \end{itemize} A \emph{maximal unique match} of \(x\) and \(y\) (\MUM for short) is a maximal match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) such that the matching substring \(\Subs{x}{i}{i+\ell-1}\) occurs exactly once in \(x\) and once in \(y\). A \emph{\MUM-candidate} is a maximal match \(\Match{i}{j}{l}\) of \(x\) and \(y\) such that the matching substring \(\Subs{x}{i}{i+\ell-1}\) occurs exactly once in \(x\). Note that any \MUM is also a \MUM-candidate, but not vice versa, since for a \MUM-candidate the matching substring may occur more than once in \(y\). If \(\Sigma\) is the DNA alphabet with the characters \(a,c,g,t\), then we define a function \(wcc\) over \(\Sigma\) by the following equations: \begin{eqnarray*} wcc(a)&=&t\\ wcc(g)&=&c\\ wcc(c)&=&g\\ wcc(t)&=&a \end{eqnarray*} The reverse complement of a DNA-sequence \(u=\Subs{u}{1}{k}\) is the sequence \[wcc(u[k])wcc(u[k-1])\ldots wcc(u[2])wcc(u[1])\] denoted by \(\overline{u}\). A \emph{reverse complement match} of \(x\) and \(y\) is a match of \(x\) and \(\overline{y}\). The notion of maximality naturally extends to reverse complement matches. To distinguish reverse complement matches from matches we often call the latter direct matches. \end{document}