jpayne@68
|
1 # Copyright 2002 by Jeffrey Chang.
|
jpayne@68
|
2 # Copyright 2016, 2019, 2020 by Markus Piotrowski.
|
jpayne@68
|
3 # All rights reserved.
|
jpayne@68
|
4 #
|
jpayne@68
|
5 # This file is part of the Biopython distribution and governed by your
|
jpayne@68
|
6 # choice of the "Biopython License Agreement" or the "BSD 3-Clause License".
|
jpayne@68
|
7 # Please see the LICENSE file that should have been included as part of this
|
jpayne@68
|
8 # package.
|
jpayne@68
|
9
|
jpayne@68
|
10 """Pairwise sequence alignment using a dynamic programming algorithm.
|
jpayne@68
|
11
|
jpayne@68
|
12 This provides functions to get global and local alignments between two
|
jpayne@68
|
13 sequences. A global alignment finds the best concordance between all
|
jpayne@68
|
14 characters in two sequences. A local alignment finds just the
|
jpayne@68
|
15 subsequences that align the best. Local alignments must have a positive
|
jpayne@68
|
16 score to be reported and they will not be extended for 'zero counting'
|
jpayne@68
|
17 matches. This means a local alignment will always start and end with
|
jpayne@68
|
18 a positive counting match.
|
jpayne@68
|
19
|
jpayne@68
|
20 When doing alignments, you can specify the match score and gap
|
jpayne@68
|
21 penalties. The match score indicates the compatibility between an
|
jpayne@68
|
22 alignment of two characters in the sequences. Highly compatible
|
jpayne@68
|
23 characters should be given positive scores, and incompatible ones
|
jpayne@68
|
24 should be given negative scores or 0. The gap penalties should be
|
jpayne@68
|
25 negative.
|
jpayne@68
|
26
|
jpayne@68
|
27 The names of the alignment functions in this module follow the
|
jpayne@68
|
28 convention
|
jpayne@68
|
29 <alignment type>XX
|
jpayne@68
|
30 where <alignment type> is either "global" or "local" and XX is a 2
|
jpayne@68
|
31 character code indicating the parameters it takes. The first
|
jpayne@68
|
32 character indicates the parameters for matches (and mismatches), and
|
jpayne@68
|
33 the second indicates the parameters for gap penalties.
|
jpayne@68
|
34
|
jpayne@68
|
35 The match parameters are::
|
jpayne@68
|
36
|
jpayne@68
|
37 CODE DESCRIPTION & OPTIONAL KEYWORDS
|
jpayne@68
|
38 x No parameters. Identical characters have score of 1, otherwise 0.
|
jpayne@68
|
39 m A match score is the score of identical chars, otherwise mismatch
|
jpayne@68
|
40 score. Keywords ``match``, ``mismatch``.
|
jpayne@68
|
41 d A dictionary returns the score of any pair of characters.
|
jpayne@68
|
42 Keyword ``match_dict``.
|
jpayne@68
|
43 c A callback function returns scores. Keyword ``match_fn``.
|
jpayne@68
|
44
|
jpayne@68
|
45 The gap penalty parameters are::
|
jpayne@68
|
46
|
jpayne@68
|
47 CODE DESCRIPTION & OPTIONAL KEYWORDS
|
jpayne@68
|
48 x No gap penalties.
|
jpayne@68
|
49 s Same open and extend gap penalties for both sequences.
|
jpayne@68
|
50 Keywords ``open``, ``extend``.
|
jpayne@68
|
51 d The sequences have different open and extend gap penalties.
|
jpayne@68
|
52 Keywords ``openA``, ``extendA``, ``openB``, ``extendB``.
|
jpayne@68
|
53 c A callback function returns the gap penalties.
|
jpayne@68
|
54 Keywords ``gap_A_fn``, ``gap_B_fn``.
|
jpayne@68
|
55
|
jpayne@68
|
56 All the different alignment functions are contained in an object
|
jpayne@68
|
57 ``align``. For example:
|
jpayne@68
|
58
|
jpayne@68
|
59 >>> from Bio import pairwise2
|
jpayne@68
|
60 >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG")
|
jpayne@68
|
61
|
jpayne@68
|
62 For better readability, the required arguments can be used with optional keywords:
|
jpayne@68
|
63
|
jpayne@68
|
64 >>> alignments = pairwise2.align.globalxx(sequenceA="ACCGT", sequenceB="ACG")
|
jpayne@68
|
65
|
jpayne@68
|
66 The result is a list of the alignments between the two strings. Each alignment
|
jpayne@68
|
67 is a named tuple consisting of the two aligned sequences, the score and the
|
jpayne@68
|
68 start and end positions of the alignment:
|
jpayne@68
|
69
|
jpayne@68
|
70 >>> print(alignments)
|
jpayne@68
|
71 [Alignment(seqA='ACCGT', seqB='A-CG-', score=3.0, start=0, end=5), ...
|
jpayne@68
|
72
|
jpayne@68
|
73 You can access each element of an alignment by index or name:
|
jpayne@68
|
74
|
jpayne@68
|
75 >>> alignments[0][2]
|
jpayne@68
|
76 3.0
|
jpayne@68
|
77 >>> alignments[0].score
|
jpayne@68
|
78 3.0
|
jpayne@68
|
79
|
jpayne@68
|
80 For a nice printout of an alignment, use the ``format_alignment`` method of
|
jpayne@68
|
81 the module:
|
jpayne@68
|
82
|
jpayne@68
|
83 >>> from Bio.pairwise2 import format_alignment
|
jpayne@68
|
84 >>> print(format_alignment(*alignments[0]))
|
jpayne@68
|
85 ACCGT
|
jpayne@68
|
86 | ||
|
jpayne@68
|
87 A-CG-
|
jpayne@68
|
88 Score=3
|
jpayne@68
|
89 <BLANKLINE>
|
jpayne@68
|
90
|
jpayne@68
|
91 All alignment functions have the following arguments:
|
jpayne@68
|
92
|
jpayne@68
|
93 - Two sequences: strings, Biopython sequence objects or lists.
|
jpayne@68
|
94 Lists are useful for supplying sequences which contain residues that are
|
jpayne@68
|
95 encoded by more than one letter.
|
jpayne@68
|
96
|
jpayne@68
|
97 - ``penalize_extend_when_opening``: boolean (default: False).
|
jpayne@68
|
98 Whether to count an extension penalty when opening a gap. If false, a gap of
|
jpayne@68
|
99 1 is only penalized an "open" penalty, otherwise it is penalized
|
jpayne@68
|
100 "open+extend".
|
jpayne@68
|
101
|
jpayne@68
|
102 - ``penalize_end_gaps``: boolean.
|
jpayne@68
|
103 Whether to count the gaps at the ends of an alignment. By default, they are
|
jpayne@68
|
104 counted for global alignments but not for local ones. Setting
|
jpayne@68
|
105 ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the
|
jpayne@68
|
106 two sequences separately whether gaps at the end of the alignment should be
|
jpayne@68
|
107 counted.
|
jpayne@68
|
108
|
jpayne@68
|
109 - ``gap_char``: string (default: ``'-'``).
|
jpayne@68
|
110 Which character to use as a gap character in the alignment returned. If your
|
jpayne@68
|
111 input sequences are lists, you must change this to ``['-']``.
|
jpayne@68
|
112
|
jpayne@68
|
113 - ``force_generic``: boolean (default: False).
|
jpayne@68
|
114 Always use the generic, non-cached, dynamic programming function (slow!).
|
jpayne@68
|
115 For debugging.
|
jpayne@68
|
116
|
jpayne@68
|
117 - ``score_only``: boolean (default: False).
|
jpayne@68
|
118 Only get the best score, don't recover any alignments. The return value of
|
jpayne@68
|
119 the function is the score. Faster and uses less memory.
|
jpayne@68
|
120
|
jpayne@68
|
121 - ``one_alignment_only``: boolean (default: False).
|
jpayne@68
|
122 Only recover one alignment.
|
jpayne@68
|
123
|
jpayne@68
|
124 The other parameters of the alignment function depend on the function called.
|
jpayne@68
|
125 Some examples:
|
jpayne@68
|
126
|
jpayne@68
|
127 - Find the best global alignment between the two sequences. Identical
|
jpayne@68
|
128 characters are given 1 point. No points are deducted for mismatches or gaps.
|
jpayne@68
|
129
|
jpayne@68
|
130 >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"):
|
jpayne@68
|
131 ... print(format_alignment(*a))
|
jpayne@68
|
132 ACCGT
|
jpayne@68
|
133 | ||
|
jpayne@68
|
134 A-CG-
|
jpayne@68
|
135 Score=3
|
jpayne@68
|
136 <BLANKLINE>
|
jpayne@68
|
137 ACCGT
|
jpayne@68
|
138 || |
|
jpayne@68
|
139 AC-G-
|
jpayne@68
|
140 Score=3
|
jpayne@68
|
141 <BLANKLINE>
|
jpayne@68
|
142
|
jpayne@68
|
143 - Same thing as before, but with a local alignment. Note that
|
jpayne@68
|
144 ``format_alignment`` will only show the aligned parts of the sequences,
|
jpayne@68
|
145 together with the starting positions.
|
jpayne@68
|
146
|
jpayne@68
|
147 >>> for a in pairwise2.align.localxx("ACCGT", "ACG"):
|
jpayne@68
|
148 ... print(format_alignment(*a))
|
jpayne@68
|
149 1 ACCG
|
jpayne@68
|
150 | ||
|
jpayne@68
|
151 1 A-CG
|
jpayne@68
|
152 Score=3
|
jpayne@68
|
153 <BLANKLINE>
|
jpayne@68
|
154 1 ACCG
|
jpayne@68
|
155 || |
|
jpayne@68
|
156 1 AC-G
|
jpayne@68
|
157 Score=3
|
jpayne@68
|
158 <BLANKLINE>
|
jpayne@68
|
159
|
jpayne@68
|
160 To restore the 'historic' behaviour of ``format_alignemt``, i.e., showing
|
jpayne@68
|
161 also the un-aligned parts of both sequences, use the new keyword parameter
|
jpayne@68
|
162 ``full_sequences``:
|
jpayne@68
|
163
|
jpayne@68
|
164 >>> for a in pairwise2.align.localxx("ACCGT", "ACG"):
|
jpayne@68
|
165 ... print(format_alignment(*a, full_sequences=True))
|
jpayne@68
|
166 ACCGT
|
jpayne@68
|
167 | ||
|
jpayne@68
|
168 A-CG-
|
jpayne@68
|
169 Score=3
|
jpayne@68
|
170 <BLANKLINE>
|
jpayne@68
|
171 ACCGT
|
jpayne@68
|
172 || |
|
jpayne@68
|
173 AC-G-
|
jpayne@68
|
174 Score=3
|
jpayne@68
|
175 <BLANKLINE>
|
jpayne@68
|
176
|
jpayne@68
|
177
|
jpayne@68
|
178 - Do a global alignment. Identical characters are given 2 points, 1 point is
|
jpayne@68
|
179 deducted for each non-identical character. Don't penalize gaps.
|
jpayne@68
|
180
|
jpayne@68
|
181 >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1):
|
jpayne@68
|
182 ... print(format_alignment(*a))
|
jpayne@68
|
183 ACCGT
|
jpayne@68
|
184 | ||
|
jpayne@68
|
185 A-CG-
|
jpayne@68
|
186 Score=6
|
jpayne@68
|
187 <BLANKLINE>
|
jpayne@68
|
188 ACCGT
|
jpayne@68
|
189 || |
|
jpayne@68
|
190 AC-G-
|
jpayne@68
|
191 Score=6
|
jpayne@68
|
192 <BLANKLINE>
|
jpayne@68
|
193
|
jpayne@68
|
194 - Same as above, except now 0.5 points are deducted when opening a gap, and
|
jpayne@68
|
195 0.1 points are deducted when extending it.
|
jpayne@68
|
196
|
jpayne@68
|
197 >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1):
|
jpayne@68
|
198 ... print(format_alignment(*a))
|
jpayne@68
|
199 ACCGT
|
jpayne@68
|
200 | ||
|
jpayne@68
|
201 A-CG-
|
jpayne@68
|
202 Score=5
|
jpayne@68
|
203 <BLANKLINE>
|
jpayne@68
|
204 ACCGT
|
jpayne@68
|
205 || |
|
jpayne@68
|
206 AC-G-
|
jpayne@68
|
207 Score=5
|
jpayne@68
|
208 <BLANKLINE>
|
jpayne@68
|
209
|
jpayne@68
|
210 - Note that you can use keywords to increase the readability, e.g.:
|
jpayne@68
|
211
|
jpayne@68
|
212 >>> a = pairwise2.align.globalms("ACGT", "ACG", match=2, mismatch=-1, open=-.5,
|
jpayne@68
|
213 ... extend=-.1)
|
jpayne@68
|
214
|
jpayne@68
|
215 - Depending on the penalties, a gap in one sequence may be followed by a gap in
|
jpayne@68
|
216 the other sequence.If you don't like this behaviour, increase the gap-open
|
jpayne@68
|
217 penalty:
|
jpayne@68
|
218
|
jpayne@68
|
219 >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1):
|
jpayne@68
|
220 ... print(format_alignment(*a))
|
jpayne@68
|
221 A-
|
jpayne@68
|
222 <BLANKLINE>
|
jpayne@68
|
223 -T
|
jpayne@68
|
224 Score=-2
|
jpayne@68
|
225 <BLANKLINE>
|
jpayne@68
|
226 >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1):
|
jpayne@68
|
227 ... print(format_alignment(*a))
|
jpayne@68
|
228 A
|
jpayne@68
|
229 .
|
jpayne@68
|
230 T
|
jpayne@68
|
231 Score=-4
|
jpayne@68
|
232 <BLANKLINE>
|
jpayne@68
|
233
|
jpayne@68
|
234 - The alignment function can also use known matrices already included in
|
jpayne@68
|
235 Biopython (in ``Bio.Align.substitution_matrices``):
|
jpayne@68
|
236
|
jpayne@68
|
237 >>> from Bio.Align import substitution_matrices
|
jpayne@68
|
238 >>> matrix = substitution_matrices.load("BLOSUM62")
|
jpayne@68
|
239 >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix):
|
jpayne@68
|
240 ... print(format_alignment(*a))
|
jpayne@68
|
241 KEVLA
|
jpayne@68
|
242 |||
|
jpayne@68
|
243 -EVL-
|
jpayne@68
|
244 Score=13
|
jpayne@68
|
245 <BLANKLINE>
|
jpayne@68
|
246
|
jpayne@68
|
247 - With the parameter ``c`` you can define your own match- and gap functions.
|
jpayne@68
|
248 E.g. to define an affine logarithmic gap function and using it:
|
jpayne@68
|
249
|
jpayne@68
|
250 >>> from math import log
|
jpayne@68
|
251 >>> def gap_function(x, y): # x is gap position in seq, y is gap length
|
jpayne@68
|
252 ... if y == 0: # No gap
|
jpayne@68
|
253 ... return 0
|
jpayne@68
|
254 ... elif y == 1: # Gap open penalty
|
jpayne@68
|
255 ... return -2
|
jpayne@68
|
256 ... return - (2 + y/4.0 + log(y)/2.0)
|
jpayne@68
|
257 ...
|
jpayne@68
|
258 >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4,
|
jpayne@68
|
259 ... gap_function, gap_function)
|
jpayne@68
|
260
|
jpayne@68
|
261 You can define different gap functions for each sequence.
|
jpayne@68
|
262 Self-defined match functions must take the two residues to be compared and
|
jpayne@68
|
263 return a score.
|
jpayne@68
|
264
|
jpayne@68
|
265 To see a description of the parameters for a function, please look at
|
jpayne@68
|
266 the docstring for the function via the help function, e.g.
|
jpayne@68
|
267 type ``help(pairwise2.align.localds)`` at the Python prompt.
|
jpayne@68
|
268
|
jpayne@68
|
269 """ # noqa: W291
|
jpayne@68
|
270
|
jpayne@68
|
271 import warnings
|
jpayne@68
|
272 from collections import namedtuple
|
jpayne@68
|
273
|
jpayne@68
|
274 from Bio import BiopythonWarning
|
jpayne@68
|
275 from Bio import BiopythonDeprecationWarning
|
jpayne@68
|
276 from Bio.Align import substitution_matrices
|
jpayne@68
|
277
|
jpayne@68
|
278 warnings.warn(
|
jpayne@68
|
279 "Bio.pairwise2 has been deprecated, and we intend to remove it in a "
|
jpayne@68
|
280 "future release of Biopython. As an alternative, please consider using "
|
jpayne@68
|
281 "Bio.Align.PairwiseAligner as a replacement, and contact the "
|
jpayne@68
|
282 "Biopython developers if you still need the Bio.pairwise2 module.",
|
jpayne@68
|
283 BiopythonDeprecationWarning,
|
jpayne@68
|
284 )
|
jpayne@68
|
285
|
jpayne@68
|
286
|
jpayne@68
|
287 MAX_ALIGNMENTS = 1000 # maximum alignments recovered in traceback
|
jpayne@68
|
288
|
jpayne@68
|
289 Alignment = namedtuple("Alignment", ("seqA, seqB, score, start, end"))
|
jpayne@68
|
290
|
jpayne@68
|
291
|
jpayne@68
|
292 class align:
|
jpayne@68
|
293 """Provide functions that do alignments.
|
jpayne@68
|
294
|
jpayne@68
|
295 Alignment functions are called as:
|
jpayne@68
|
296
|
jpayne@68
|
297 pairwise2.align.globalXX
|
jpayne@68
|
298
|
jpayne@68
|
299 or
|
jpayne@68
|
300
|
jpayne@68
|
301 pairwise2.align.localXX
|
jpayne@68
|
302
|
jpayne@68
|
303 Where XX is a 2 character code indicating the match/mismatch parameters
|
jpayne@68
|
304 (first character, either x, m, d or c) and the gap penalty parameters
|
jpayne@68
|
305 (second character, either x, s, d, or c).
|
jpayne@68
|
306
|
jpayne@68
|
307 For a detailed description read the main module's docstring (e.g.,
|
jpayne@68
|
308 type ``help(pairwise2)``).
|
jpayne@68
|
309 To see a description of the parameters for a function, please
|
jpayne@68
|
310 look at the docstring for the function, e.g. type
|
jpayne@68
|
311 ``help(pairwise2.align.localds)`` at the Python prompt.
|
jpayne@68
|
312 """
|
jpayne@68
|
313
|
jpayne@68
|
314 class alignment_function:
|
jpayne@68
|
315 """Callable class which impersonates an alignment function.
|
jpayne@68
|
316
|
jpayne@68
|
317 The constructor takes the name of the function. This class
|
jpayne@68
|
318 will decode the name of the function to figure out how to
|
jpayne@68
|
319 interpret the parameters.
|
jpayne@68
|
320 """
|
jpayne@68
|
321
|
jpayne@68
|
322 # match code -> tuple of (parameters, docstring)
|
jpayne@68
|
323 match2args = {
|
jpayne@68
|
324 "x": ([], ""),
|
jpayne@68
|
325 "m": (
|
jpayne@68
|
326 ["match", "mismatch"],
|
jpayne@68
|
327 "match is the score to given to identical characters.\n"
|
jpayne@68
|
328 "mismatch is the score given to non-identical ones.",
|
jpayne@68
|
329 ),
|
jpayne@68
|
330 "d": (
|
jpayne@68
|
331 ["match_dict"],
|
jpayne@68
|
332 "match_dict is a dictionary where the keys are tuples\n"
|
jpayne@68
|
333 "of pairs of characters and the values are the scores,\n"
|
jpayne@68
|
334 "e.g. ('A', 'C') : 2.5.",
|
jpayne@68
|
335 ),
|
jpayne@68
|
336 "c": (
|
jpayne@68
|
337 ["match_fn"],
|
jpayne@68
|
338 "match_fn is a callback function that takes two "
|
jpayne@68
|
339 "characters and returns the score between them.",
|
jpayne@68
|
340 ),
|
jpayne@68
|
341 }
|
jpayne@68
|
342 # penalty code -> tuple of (parameters, docstring)
|
jpayne@68
|
343 penalty2args = {
|
jpayne@68
|
344 "x": ([], ""),
|
jpayne@68
|
345 "s": (
|
jpayne@68
|
346 ["open", "extend"],
|
jpayne@68
|
347 "open and extend are the gap penalties when a gap is\n"
|
jpayne@68
|
348 "opened and extended. They should be negative.",
|
jpayne@68
|
349 ),
|
jpayne@68
|
350 "d": (
|
jpayne@68
|
351 ["openA", "extendA", "openB", "extendB"],
|
jpayne@68
|
352 "openA and extendA are the gap penalties for sequenceA,\n"
|
jpayne@68
|
353 "and openB and extendB for sequenceB. The penalties\n"
|
jpayne@68
|
354 "should be negative.",
|
jpayne@68
|
355 ),
|
jpayne@68
|
356 "c": (
|
jpayne@68
|
357 ["gap_A_fn", "gap_B_fn"],
|
jpayne@68
|
358 "gap_A_fn and gap_B_fn are callback functions that takes\n"
|
jpayne@68
|
359 "(1) the index where the gap is opened, and (2) the length\n"
|
jpayne@68
|
360 "of the gap. They should return a gap penalty.",
|
jpayne@68
|
361 ),
|
jpayne@68
|
362 }
|
jpayne@68
|
363
|
jpayne@68
|
364 def __init__(self, name):
|
jpayne@68
|
365 """Check to make sure the name of the function is reasonable."""
|
jpayne@68
|
366 if name.startswith("global"):
|
jpayne@68
|
367 if len(name) != 8:
|
jpayne@68
|
368 raise AttributeError("function should be globalXX")
|
jpayne@68
|
369 elif name.startswith("local"):
|
jpayne@68
|
370 if len(name) != 7:
|
jpayne@68
|
371 raise AttributeError("function should be localXX")
|
jpayne@68
|
372 else:
|
jpayne@68
|
373 raise AttributeError(name)
|
jpayne@68
|
374 align_type, match_type, penalty_type = name[:-2], name[-2], name[-1]
|
jpayne@68
|
375 try:
|
jpayne@68
|
376 match_args, match_doc = self.match2args[match_type]
|
jpayne@68
|
377 except KeyError:
|
jpayne@68
|
378 raise AttributeError(f"unknown match type {match_type!r}")
|
jpayne@68
|
379 try:
|
jpayne@68
|
380 penalty_args, penalty_doc = self.penalty2args[penalty_type]
|
jpayne@68
|
381 except KeyError:
|
jpayne@68
|
382 raise AttributeError(f"unknown penalty type {penalty_type!r}")
|
jpayne@68
|
383
|
jpayne@68
|
384 # Now get the names of the parameters to this function.
|
jpayne@68
|
385 param_names = ["sequenceA", "sequenceB"]
|
jpayne@68
|
386 param_names.extend(match_args)
|
jpayne@68
|
387 param_names.extend(penalty_args)
|
jpayne@68
|
388 self.function_name = name
|
jpayne@68
|
389 self.align_type = align_type
|
jpayne@68
|
390 self.param_names = param_names
|
jpayne@68
|
391
|
jpayne@68
|
392 self.__name__ = self.function_name
|
jpayne@68
|
393 # Set the doc string.
|
jpayne@68
|
394 doc = f"{self.__name__}({', '.join(self.param_names)}) -> alignments\n"
|
jpayne@68
|
395 doc += """\
|
jpayne@68
|
396 \nThe following parameters can also be used with optional
|
jpayne@68
|
397 keywords of the same name.\n\n
|
jpayne@68
|
398 sequenceA and sequenceB must be of the same type, either
|
jpayne@68
|
399 strings, lists or Biopython sequence objects.\n
|
jpayne@68
|
400 """
|
jpayne@68
|
401 if match_doc:
|
jpayne@68
|
402 doc += f"\n{match_doc}\n"
|
jpayne@68
|
403 if penalty_doc:
|
jpayne@68
|
404 doc += f"\n{penalty_doc}\n"
|
jpayne@68
|
405 doc += """\
|
jpayne@68
|
406 \nalignments is a list of named tuples (seqA, seqB, score,
|
jpayne@68
|
407 begin, end). seqA and seqB are strings showing the alignment
|
jpayne@68
|
408 between the sequences. score is the score of the alignment.
|
jpayne@68
|
409 begin and end are indexes of seqA and seqB that indicate
|
jpayne@68
|
410 where the alignment occurs.
|
jpayne@68
|
411 """
|
jpayne@68
|
412 self.__doc__ = doc
|
jpayne@68
|
413
|
jpayne@68
|
414 def decode(self, *args, **keywds):
|
jpayne@68
|
415 """Decode the arguments for the _align function.
|
jpayne@68
|
416
|
jpayne@68
|
417 keywds will get passed to it, so translate the arguments
|
jpayne@68
|
418 to this function into forms appropriate for _align.
|
jpayne@68
|
419 """
|
jpayne@68
|
420 keywds = keywds.copy()
|
jpayne@68
|
421
|
jpayne@68
|
422 # Replace possible "keywords" with arguments:
|
jpayne@68
|
423 args += (len(self.param_names) - len(args)) * (None,)
|
jpayne@68
|
424 for key in keywds.copy():
|
jpayne@68
|
425 if key in self.param_names:
|
jpayne@68
|
426 _index = self.param_names.index(key)
|
jpayne@68
|
427 args = args[:_index] + (keywds[key],) + args[_index:]
|
jpayne@68
|
428 del keywds[key]
|
jpayne@68
|
429 args = tuple(arg for arg in args if arg is not None)
|
jpayne@68
|
430
|
jpayne@68
|
431 if len(args) != len(self.param_names):
|
jpayne@68
|
432 raise TypeError(
|
jpayne@68
|
433 "%s takes exactly %d argument (%d given)"
|
jpayne@68
|
434 % (self.function_name, len(self.param_names), len(args))
|
jpayne@68
|
435 )
|
jpayne@68
|
436
|
jpayne@68
|
437 i = 0
|
jpayne@68
|
438 while i < len(self.param_names):
|
jpayne@68
|
439 if self.param_names[i] in [
|
jpayne@68
|
440 "sequenceA",
|
jpayne@68
|
441 "sequenceB",
|
jpayne@68
|
442 "gap_A_fn",
|
jpayne@68
|
443 "gap_B_fn",
|
jpayne@68
|
444 "match_fn",
|
jpayne@68
|
445 ]:
|
jpayne@68
|
446 keywds[self.param_names[i]] = args[i]
|
jpayne@68
|
447 i += 1
|
jpayne@68
|
448 elif self.param_names[i] == "match":
|
jpayne@68
|
449 assert self.param_names[i + 1] == "mismatch"
|
jpayne@68
|
450 match, mismatch = args[i], args[i + 1]
|
jpayne@68
|
451 keywds["match_fn"] = identity_match(match, mismatch)
|
jpayne@68
|
452 i += 2
|
jpayne@68
|
453 elif self.param_names[i] == "match_dict":
|
jpayne@68
|
454 keywds["match_fn"] = dictionary_match(args[i])
|
jpayne@68
|
455 i += 1
|
jpayne@68
|
456 elif self.param_names[i] == "open":
|
jpayne@68
|
457 assert self.param_names[i + 1] == "extend"
|
jpayne@68
|
458 open, extend = args[i], args[i + 1]
|
jpayne@68
|
459 pe = keywds.get("penalize_extend_when_opening", 0)
|
jpayne@68
|
460 keywds["gap_A_fn"] = affine_penalty(open, extend, pe)
|
jpayne@68
|
461 keywds["gap_B_fn"] = affine_penalty(open, extend, pe)
|
jpayne@68
|
462 i += 2
|
jpayne@68
|
463 elif self.param_names[i] == "openA":
|
jpayne@68
|
464 assert self.param_names[i + 3] == "extendB"
|
jpayne@68
|
465 openA, extendA, openB, extendB = args[i : i + 4]
|
jpayne@68
|
466 pe = keywds.get("penalize_extend_when_opening", 0)
|
jpayne@68
|
467 keywds["gap_A_fn"] = affine_penalty(openA, extendA, pe)
|
jpayne@68
|
468 keywds["gap_B_fn"] = affine_penalty(openB, extendB, pe)
|
jpayne@68
|
469 i += 4
|
jpayne@68
|
470 else:
|
jpayne@68
|
471 raise ValueError(f"unknown parameter {self.param_names[i]!r}")
|
jpayne@68
|
472
|
jpayne@68
|
473 # Here are the default parameters for _align. Assign
|
jpayne@68
|
474 # these to keywds, unless already specified.
|
jpayne@68
|
475 pe = keywds.get("penalize_extend_when_opening", 0)
|
jpayne@68
|
476 default_params = [
|
jpayne@68
|
477 ("match_fn", identity_match(1, 0)),
|
jpayne@68
|
478 ("gap_A_fn", affine_penalty(0, 0, pe)),
|
jpayne@68
|
479 ("gap_B_fn", affine_penalty(0, 0, pe)),
|
jpayne@68
|
480 ("penalize_extend_when_opening", 0),
|
jpayne@68
|
481 ("penalize_end_gaps", self.align_type == "global"),
|
jpayne@68
|
482 ("align_globally", self.align_type == "global"),
|
jpayne@68
|
483 ("gap_char", "-"),
|
jpayne@68
|
484 ("force_generic", 0),
|
jpayne@68
|
485 ("score_only", 0),
|
jpayne@68
|
486 ("one_alignment_only", 0),
|
jpayne@68
|
487 ]
|
jpayne@68
|
488 for name, default in default_params:
|
jpayne@68
|
489 keywds[name] = keywds.get(name, default)
|
jpayne@68
|
490 value = keywds["penalize_end_gaps"]
|
jpayne@68
|
491 try:
|
jpayne@68
|
492 n = len(value)
|
jpayne@68
|
493 except TypeError:
|
jpayne@68
|
494 keywds["penalize_end_gaps"] = tuple([value] * 2)
|
jpayne@68
|
495 else:
|
jpayne@68
|
496 assert n == 2
|
jpayne@68
|
497 return keywds
|
jpayne@68
|
498
|
jpayne@68
|
499 def __call__(self, *args, **keywds):
|
jpayne@68
|
500 """Call the alignment instance already created."""
|
jpayne@68
|
501 keywds = self.decode(*args, **keywds)
|
jpayne@68
|
502 return _align(**keywds)
|
jpayne@68
|
503
|
jpayne@68
|
504 def __getattr__(self, attr):
|
jpayne@68
|
505 """Call alignment_function() to check and decode the attributes."""
|
jpayne@68
|
506 # The following 'magic' is needed to rewrite the class docstring
|
jpayne@68
|
507 # dynamically:
|
jpayne@68
|
508 wrapper = self.alignment_function(attr)
|
jpayne@68
|
509 wrapper_type = type(wrapper)
|
jpayne@68
|
510 wrapper_dict = wrapper_type.__dict__.copy()
|
jpayne@68
|
511 wrapper_dict["__doc__"] = wrapper.__doc__
|
jpayne@68
|
512 new_alignment_function = type("alignment_function", (object,), wrapper_dict)
|
jpayne@68
|
513
|
jpayne@68
|
514 return new_alignment_function(attr)
|
jpayne@68
|
515
|
jpayne@68
|
516
|
jpayne@68
|
517 align = align()
|
jpayne@68
|
518
|
jpayne@68
|
519
|
jpayne@68
|
520 def _align(
|
jpayne@68
|
521 sequenceA,
|
jpayne@68
|
522 sequenceB,
|
jpayne@68
|
523 match_fn,
|
jpayne@68
|
524 gap_A_fn,
|
jpayne@68
|
525 gap_B_fn,
|
jpayne@68
|
526 penalize_extend_when_opening,
|
jpayne@68
|
527 penalize_end_gaps,
|
jpayne@68
|
528 align_globally,
|
jpayne@68
|
529 gap_char,
|
jpayne@68
|
530 force_generic,
|
jpayne@68
|
531 score_only,
|
jpayne@68
|
532 one_alignment_only,
|
jpayne@68
|
533 ):
|
jpayne@68
|
534 """Return optimal alignments between two sequences (PRIVATE).
|
jpayne@68
|
535
|
jpayne@68
|
536 This method either returns a list of optimal alignments (with the same
|
jpayne@68
|
537 score) or just the optimal score.
|
jpayne@68
|
538 """
|
jpayne@68
|
539 if not sequenceA or not sequenceB:
|
jpayne@68
|
540 return []
|
jpayne@68
|
541 try:
|
jpayne@68
|
542 sequenceA + gap_char
|
jpayne@68
|
543 sequenceB + gap_char
|
jpayne@68
|
544 except TypeError:
|
jpayne@68
|
545 raise TypeError(
|
jpayne@68
|
546 "both sequences must be of the same type, either "
|
jpayne@68
|
547 "string/sequence object or list. Gap character must "
|
jpayne@68
|
548 "fit the sequence type (string or list)"
|
jpayne@68
|
549 )
|
jpayne@68
|
550
|
jpayne@68
|
551 if not isinstance(sequenceA, list):
|
jpayne@68
|
552 sequenceA = str(sequenceA)
|
jpayne@68
|
553 if not isinstance(sequenceB, list):
|
jpayne@68
|
554 sequenceB = str(sequenceB)
|
jpayne@68
|
555 if not align_globally and (penalize_end_gaps[0] or penalize_end_gaps[1]):
|
jpayne@68
|
556 warnings.warn(
|
jpayne@68
|
557 '"penalize_end_gaps" should not be used in local '
|
jpayne@68
|
558 "alignments. The resulting score may be wrong.",
|
jpayne@68
|
559 BiopythonWarning,
|
jpayne@68
|
560 )
|
jpayne@68
|
561
|
jpayne@68
|
562 if (
|
jpayne@68
|
563 (not force_generic)
|
jpayne@68
|
564 and isinstance(gap_A_fn, affine_penalty)
|
jpayne@68
|
565 and isinstance(gap_B_fn, affine_penalty)
|
jpayne@68
|
566 ):
|
jpayne@68
|
567 open_A, extend_A = gap_A_fn.open, gap_A_fn.extend
|
jpayne@68
|
568 open_B, extend_B = gap_B_fn.open, gap_B_fn.extend
|
jpayne@68
|
569 matrices = _make_score_matrix_fast(
|
jpayne@68
|
570 sequenceA,
|
jpayne@68
|
571 sequenceB,
|
jpayne@68
|
572 match_fn,
|
jpayne@68
|
573 open_A,
|
jpayne@68
|
574 extend_A,
|
jpayne@68
|
575 open_B,
|
jpayne@68
|
576 extend_B,
|
jpayne@68
|
577 penalize_extend_when_opening,
|
jpayne@68
|
578 penalize_end_gaps,
|
jpayne@68
|
579 align_globally,
|
jpayne@68
|
580 score_only,
|
jpayne@68
|
581 )
|
jpayne@68
|
582 else:
|
jpayne@68
|
583 matrices = _make_score_matrix_generic(
|
jpayne@68
|
584 sequenceA,
|
jpayne@68
|
585 sequenceB,
|
jpayne@68
|
586 match_fn,
|
jpayne@68
|
587 gap_A_fn,
|
jpayne@68
|
588 gap_B_fn,
|
jpayne@68
|
589 penalize_end_gaps,
|
jpayne@68
|
590 align_globally,
|
jpayne@68
|
591 score_only,
|
jpayne@68
|
592 )
|
jpayne@68
|
593
|
jpayne@68
|
594 score_matrix, trace_matrix, best_score = matrices
|
jpayne@68
|
595
|
jpayne@68
|
596 # print("SCORE %s" % print_matrix(score_matrix))
|
jpayne@68
|
597 # print("TRACEBACK %s" % print_matrix(trace_matrix))
|
jpayne@68
|
598
|
jpayne@68
|
599 # If they only want the score, then return it.
|
jpayne@68
|
600 if score_only:
|
jpayne@68
|
601 return best_score
|
jpayne@68
|
602
|
jpayne@68
|
603 starts = _find_start(score_matrix, best_score, align_globally)
|
jpayne@68
|
604
|
jpayne@68
|
605 # Recover the alignments and return them.
|
jpayne@68
|
606 alignments = _recover_alignments(
|
jpayne@68
|
607 sequenceA,
|
jpayne@68
|
608 sequenceB,
|
jpayne@68
|
609 starts,
|
jpayne@68
|
610 best_score,
|
jpayne@68
|
611 score_matrix,
|
jpayne@68
|
612 trace_matrix,
|
jpayne@68
|
613 align_globally,
|
jpayne@68
|
614 gap_char,
|
jpayne@68
|
615 one_alignment_only,
|
jpayne@68
|
616 gap_A_fn,
|
jpayne@68
|
617 gap_B_fn,
|
jpayne@68
|
618 )
|
jpayne@68
|
619 if not alignments:
|
jpayne@68
|
620 # This may happen, see recover_alignments for explanation
|
jpayne@68
|
621 score_matrix, trace_matrix = _reverse_matrices(score_matrix, trace_matrix)
|
jpayne@68
|
622 starts = [(z, (y, x)) for z, (x, y) in starts]
|
jpayne@68
|
623 alignments = _recover_alignments(
|
jpayne@68
|
624 sequenceB,
|
jpayne@68
|
625 sequenceA,
|
jpayne@68
|
626 starts,
|
jpayne@68
|
627 best_score,
|
jpayne@68
|
628 score_matrix,
|
jpayne@68
|
629 trace_matrix,
|
jpayne@68
|
630 align_globally,
|
jpayne@68
|
631 gap_char,
|
jpayne@68
|
632 one_alignment_only,
|
jpayne@68
|
633 gap_B_fn,
|
jpayne@68
|
634 gap_A_fn,
|
jpayne@68
|
635 reverse=True,
|
jpayne@68
|
636 )
|
jpayne@68
|
637 return alignments
|
jpayne@68
|
638
|
jpayne@68
|
639
|
jpayne@68
|
640 def _make_score_matrix_generic(
|
jpayne@68
|
641 sequenceA,
|
jpayne@68
|
642 sequenceB,
|
jpayne@68
|
643 match_fn,
|
jpayne@68
|
644 gap_A_fn,
|
jpayne@68
|
645 gap_B_fn,
|
jpayne@68
|
646 penalize_end_gaps,
|
jpayne@68
|
647 align_globally,
|
jpayne@68
|
648 score_only,
|
jpayne@68
|
649 ):
|
jpayne@68
|
650 """Generate a score and traceback matrix (PRIVATE).
|
jpayne@68
|
651
|
jpayne@68
|
652 This implementation according to Needleman-Wunsch allows the usage of
|
jpayne@68
|
653 general gap functions and is rather slow. It is automatically called if
|
jpayne@68
|
654 you define your own gap functions. You can force the usage of this method
|
jpayne@68
|
655 with ``force_generic=True``.
|
jpayne@68
|
656 """
|
jpayne@68
|
657 local_max_score = 0
|
jpayne@68
|
658 # Create the score and traceback matrices. These should be in the
|
jpayne@68
|
659 # shape:
|
jpayne@68
|
660 # sequenceA (down) x sequenceB (across)
|
jpayne@68
|
661 lenA, lenB = len(sequenceA), len(sequenceB)
|
jpayne@68
|
662 score_matrix, trace_matrix = [], []
|
jpayne@68
|
663 for i in range(lenA + 1):
|
jpayne@68
|
664 score_matrix.append([None] * (lenB + 1))
|
jpayne@68
|
665 if not score_only:
|
jpayne@68
|
666 trace_matrix.append([None] * (lenB + 1))
|
jpayne@68
|
667
|
jpayne@68
|
668 # Initialize first row and column with gap scores. This is like opening up
|
jpayne@68
|
669 # i gaps at the beginning of sequence A or B.
|
jpayne@68
|
670 for i in range(lenA + 1):
|
jpayne@68
|
671 if penalize_end_gaps[1]: # [1]:gap in sequence B
|
jpayne@68
|
672 score = gap_B_fn(0, i)
|
jpayne@68
|
673 else:
|
jpayne@68
|
674 score = 0.0
|
jpayne@68
|
675 score_matrix[i][0] = score
|
jpayne@68
|
676 for i in range(lenB + 1):
|
jpayne@68
|
677 if penalize_end_gaps[0]: # [0]:gap in sequence A
|
jpayne@68
|
678 score = gap_A_fn(0, i)
|
jpayne@68
|
679 else:
|
jpayne@68
|
680 score = 0.0
|
jpayne@68
|
681 score_matrix[0][i] = score
|
jpayne@68
|
682
|
jpayne@68
|
683 # Fill in the score matrix. Each position in the matrix
|
jpayne@68
|
684 # represents an alignment between a character from sequence A to
|
jpayne@68
|
685 # one in sequence B. As I iterate through the matrix, find the
|
jpayne@68
|
686 # alignment by choose the best of:
|
jpayne@68
|
687 # 1) extending a previous alignment without gaps
|
jpayne@68
|
688 # 2) adding a gap in sequenceA
|
jpayne@68
|
689 # 3) adding a gap in sequenceB
|
jpayne@68
|
690 for row in range(1, lenA + 1):
|
jpayne@68
|
691 for col in range(1, lenB + 1):
|
jpayne@68
|
692 # First, calculate the score that would occur by extending
|
jpayne@68
|
693 # the alignment without gaps.
|
jpayne@68
|
694 # fmt: off
|
jpayne@68
|
695 nogap_score = (
|
jpayne@68
|
696 score_matrix[row - 1][col - 1]
|
jpayne@68
|
697 + match_fn(sequenceA[row - 1], sequenceB[col - 1])
|
jpayne@68
|
698 )
|
jpayne@68
|
699
|
jpayne@68
|
700 # fmt: on
|
jpayne@68
|
701 # Try to find a better score by opening gaps in sequenceA.
|
jpayne@68
|
702 # Do this by checking alignments from each column in the row.
|
jpayne@68
|
703 # Each column represents a different character to align from,
|
jpayne@68
|
704 # and thus a different length gap.
|
jpayne@68
|
705 # Although the gap function does not distinguish between opening
|
jpayne@68
|
706 # and extending a gap, we distinguish them for the backtrace.
|
jpayne@68
|
707 if not penalize_end_gaps[0] and row == lenA:
|
jpayne@68
|
708 row_open = score_matrix[row][col - 1]
|
jpayne@68
|
709 row_extend = max(score_matrix[row][x] for x in range(col))
|
jpayne@68
|
710 else:
|
jpayne@68
|
711 row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1)
|
jpayne@68
|
712 row_extend = max(
|
jpayne@68
|
713 score_matrix[row][x] + gap_A_fn(row, col - x) for x in range(col)
|
jpayne@68
|
714 )
|
jpayne@68
|
715
|
jpayne@68
|
716 # Try to find a better score by opening gaps in sequenceB.
|
jpayne@68
|
717 if not penalize_end_gaps[1] and col == lenB:
|
jpayne@68
|
718 col_open = score_matrix[row - 1][col]
|
jpayne@68
|
719 col_extend = max(score_matrix[x][col] for x in range(row))
|
jpayne@68
|
720 else:
|
jpayne@68
|
721 col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1)
|
jpayne@68
|
722 col_extend = max(
|
jpayne@68
|
723 score_matrix[x][col] + gap_B_fn(col, row - x) for x in range(row)
|
jpayne@68
|
724 )
|
jpayne@68
|
725
|
jpayne@68
|
726 best_score = max(nogap_score, row_open, row_extend, col_open, col_extend)
|
jpayne@68
|
727 local_max_score = max(local_max_score, best_score)
|
jpayne@68
|
728 if not align_globally and best_score < 0:
|
jpayne@68
|
729 score_matrix[row][col] = 0.0
|
jpayne@68
|
730 else:
|
jpayne@68
|
731 score_matrix[row][col] = best_score
|
jpayne@68
|
732
|
jpayne@68
|
733 # The backtrace is encoded binary. See _make_score_matrix_fast
|
jpayne@68
|
734 # for details.
|
jpayne@68
|
735 if not score_only:
|
jpayne@68
|
736 trace_score = 0
|
jpayne@68
|
737 if rint(nogap_score) == rint(best_score):
|
jpayne@68
|
738 trace_score += 2
|
jpayne@68
|
739 if rint(row_open) == rint(best_score):
|
jpayne@68
|
740 trace_score += 1
|
jpayne@68
|
741 if rint(row_extend) == rint(best_score):
|
jpayne@68
|
742 trace_score += 8
|
jpayne@68
|
743 if rint(col_open) == rint(best_score):
|
jpayne@68
|
744 trace_score += 4
|
jpayne@68
|
745 if rint(col_extend) == rint(best_score):
|
jpayne@68
|
746 trace_score += 16
|
jpayne@68
|
747 trace_matrix[row][col] = trace_score
|
jpayne@68
|
748
|
jpayne@68
|
749 if not align_globally:
|
jpayne@68
|
750 best_score = local_max_score
|
jpayne@68
|
751
|
jpayne@68
|
752 return score_matrix, trace_matrix, best_score
|
jpayne@68
|
753
|
jpayne@68
|
754
|
jpayne@68
|
755 def _make_score_matrix_fast(
|
jpayne@68
|
756 sequenceA,
|
jpayne@68
|
757 sequenceB,
|
jpayne@68
|
758 match_fn,
|
jpayne@68
|
759 open_A,
|
jpayne@68
|
760 extend_A,
|
jpayne@68
|
761 open_B,
|
jpayne@68
|
762 extend_B,
|
jpayne@68
|
763 penalize_extend_when_opening,
|
jpayne@68
|
764 penalize_end_gaps,
|
jpayne@68
|
765 align_globally,
|
jpayne@68
|
766 score_only,
|
jpayne@68
|
767 ):
|
jpayne@68
|
768 """Generate a score and traceback matrix according to Gotoh (PRIVATE).
|
jpayne@68
|
769
|
jpayne@68
|
770 This is an implementation of the Needleman-Wunsch dynamic programming
|
jpayne@68
|
771 algorithm as modified by Gotoh, implementing affine gap penalties.
|
jpayne@68
|
772 In short, we have three matrices, holding scores for alignments ending
|
jpayne@68
|
773 in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in
|
jpayne@68
|
774 sequence B, respectively. However, we can combine them in one matrix,
|
jpayne@68
|
775 which holds the best scores, and store only those values from the
|
jpayne@68
|
776 other matrices that are actually used for the next step of calculation.
|
jpayne@68
|
777 The traceback matrix holds the positions for backtracing the alignment.
|
jpayne@68
|
778 """
|
jpayne@68
|
779 first_A_gap = calc_affine_penalty(1, open_A, extend_A, penalize_extend_when_opening)
|
jpayne@68
|
780 first_B_gap = calc_affine_penalty(1, open_B, extend_B, penalize_extend_when_opening)
|
jpayne@68
|
781 local_max_score = 0
|
jpayne@68
|
782
|
jpayne@68
|
783 # Create the score and traceback matrices. These should be in the
|
jpayne@68
|
784 # shape:
|
jpayne@68
|
785 # sequenceA (down) x sequenceB (across)
|
jpayne@68
|
786 lenA, lenB = len(sequenceA), len(sequenceB)
|
jpayne@68
|
787 score_matrix, trace_matrix = [], []
|
jpayne@68
|
788 for i in range(lenA + 1):
|
jpayne@68
|
789 score_matrix.append([None] * (lenB + 1))
|
jpayne@68
|
790 if not score_only:
|
jpayne@68
|
791 trace_matrix.append([None] * (lenB + 1))
|
jpayne@68
|
792
|
jpayne@68
|
793 # Initialize first row and column with gap scores. This is like opening up
|
jpayne@68
|
794 # i gaps at the beginning of sequence A or B.
|
jpayne@68
|
795 for i in range(lenA + 1):
|
jpayne@68
|
796 if penalize_end_gaps[1]: # [1]:gap in sequence B
|
jpayne@68
|
797 score = calc_affine_penalty(
|
jpayne@68
|
798 i, open_B, extend_B, penalize_extend_when_opening
|
jpayne@68
|
799 )
|
jpayne@68
|
800 else:
|
jpayne@68
|
801 score = 0
|
jpayne@68
|
802 score_matrix[i][0] = score
|
jpayne@68
|
803 for i in range(lenB + 1):
|
jpayne@68
|
804 if penalize_end_gaps[0]: # [0]:gap in sequence A
|
jpayne@68
|
805 score = calc_affine_penalty(
|
jpayne@68
|
806 i, open_A, extend_A, penalize_extend_when_opening
|
jpayne@68
|
807 )
|
jpayne@68
|
808 else:
|
jpayne@68
|
809 score = 0
|
jpayne@68
|
810 score_matrix[0][i] = score
|
jpayne@68
|
811
|
jpayne@68
|
812 # Now initialize the col 'matrix'. Actually this is only a one dimensional
|
jpayne@68
|
813 # list, since we only need the col scores from the last row.
|
jpayne@68
|
814 col_score = [0] # Best score, if actual alignment ends with gap in seqB
|
jpayne@68
|
815 for i in range(1, lenB + 1):
|
jpayne@68
|
816 col_score.append(
|
jpayne@68
|
817 calc_affine_penalty(i, 2 * open_B, extend_B, penalize_extend_when_opening)
|
jpayne@68
|
818 )
|
jpayne@68
|
819
|
jpayne@68
|
820 # The row 'matrix' is calculated on the fly. Here we only need the actual
|
jpayne@68
|
821 # score.
|
jpayne@68
|
822 # Now, filling up the score and traceback matrices:
|
jpayne@68
|
823 for row in range(1, lenA + 1):
|
jpayne@68
|
824 row_score = calc_affine_penalty(
|
jpayne@68
|
825 row, 2 * open_A, extend_A, penalize_extend_when_opening
|
jpayne@68
|
826 )
|
jpayne@68
|
827 for col in range(1, lenB + 1):
|
jpayne@68
|
828 # Calculate the score that would occur by extending the
|
jpayne@68
|
829 # alignment without gaps.
|
jpayne@68
|
830 # fmt: off
|
jpayne@68
|
831 nogap_score = (
|
jpayne@68
|
832 score_matrix[row - 1][col - 1]
|
jpayne@68
|
833 + match_fn(sequenceA[row - 1], sequenceB[col - 1])
|
jpayne@68
|
834 )
|
jpayne@68
|
835 # fmt: on
|
jpayne@68
|
836 # Check the score that would occur if there were a gap in
|
jpayne@68
|
837 # sequence A. This could come from opening a new gap or
|
jpayne@68
|
838 # extending an existing one.
|
jpayne@68
|
839 # A gap in sequence A can also be opened if it follows a gap in
|
jpayne@68
|
840 # sequence B: A-
|
jpayne@68
|
841 # -B
|
jpayne@68
|
842 if not penalize_end_gaps[0] and row == lenA:
|
jpayne@68
|
843 row_open = score_matrix[row][col - 1]
|
jpayne@68
|
844 row_extend = row_score
|
jpayne@68
|
845 else:
|
jpayne@68
|
846 row_open = score_matrix[row][col - 1] + first_A_gap
|
jpayne@68
|
847 row_extend = row_score + extend_A
|
jpayne@68
|
848 row_score = max(row_open, row_extend)
|
jpayne@68
|
849
|
jpayne@68
|
850 # The same for sequence B:
|
jpayne@68
|
851 if not penalize_end_gaps[1] and col == lenB:
|
jpayne@68
|
852 col_open = score_matrix[row - 1][col]
|
jpayne@68
|
853 col_extend = col_score[col]
|
jpayne@68
|
854 else:
|
jpayne@68
|
855 col_open = score_matrix[row - 1][col] + first_B_gap
|
jpayne@68
|
856 col_extend = col_score[col] + extend_B
|
jpayne@68
|
857 col_score[col] = max(col_open, col_extend)
|
jpayne@68
|
858
|
jpayne@68
|
859 best_score = max(nogap_score, col_score[col], row_score)
|
jpayne@68
|
860 local_max_score = max(local_max_score, best_score)
|
jpayne@68
|
861 if not align_globally and best_score < 0:
|
jpayne@68
|
862 score_matrix[row][col] = 0
|
jpayne@68
|
863 else:
|
jpayne@68
|
864 score_matrix[row][col] = best_score
|
jpayne@68
|
865
|
jpayne@68
|
866 # Now the trace_matrix. The edges of the backtrace are encoded
|
jpayne@68
|
867 # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and
|
jpayne@68
|
868 # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and
|
jpayne@68
|
869 # 16 = extend gap in seqB. This values can be summed up.
|
jpayne@68
|
870 # Thus, the trace score 7 means that the best score can either
|
jpayne@68
|
871 # come from opening a gap in seqA (=1), pairing two characters
|
jpayne@68
|
872 # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7).
|
jpayne@68
|
873 # However, if we only want the score we don't care about the trace.
|
jpayne@68
|
874 if not score_only:
|
jpayne@68
|
875 row_score_rint = rint(row_score)
|
jpayne@68
|
876 col_score_rint = rint(col_score[col])
|
jpayne@68
|
877 row_trace_score = 0
|
jpayne@68
|
878 col_trace_score = 0
|
jpayne@68
|
879 if rint(row_open) == row_score_rint:
|
jpayne@68
|
880 row_trace_score += 1 # Open gap in seqA
|
jpayne@68
|
881 if rint(row_extend) == row_score_rint:
|
jpayne@68
|
882 row_trace_score += 8 # Extend gap in seqA
|
jpayne@68
|
883 if rint(col_open) == col_score_rint:
|
jpayne@68
|
884 col_trace_score += 4 # Open gap in seqB
|
jpayne@68
|
885 if rint(col_extend) == col_score_rint:
|
jpayne@68
|
886 col_trace_score += 16 # Extend gap in seqB
|
jpayne@68
|
887
|
jpayne@68
|
888 trace_score = 0
|
jpayne@68
|
889 best_score_rint = rint(best_score)
|
jpayne@68
|
890 if rint(nogap_score) == best_score_rint:
|
jpayne@68
|
891 trace_score += 2 # Align seqA with seqB
|
jpayne@68
|
892 if row_score_rint == best_score_rint:
|
jpayne@68
|
893 trace_score += row_trace_score
|
jpayne@68
|
894 if col_score_rint == best_score_rint:
|
jpayne@68
|
895 trace_score += col_trace_score
|
jpayne@68
|
896 trace_matrix[row][col] = trace_score
|
jpayne@68
|
897
|
jpayne@68
|
898 if not align_globally:
|
jpayne@68
|
899 best_score = local_max_score
|
jpayne@68
|
900
|
jpayne@68
|
901 return score_matrix, trace_matrix, best_score
|
jpayne@68
|
902
|
jpayne@68
|
903
|
jpayne@68
|
904 def _recover_alignments(
|
jpayne@68
|
905 sequenceA,
|
jpayne@68
|
906 sequenceB,
|
jpayne@68
|
907 starts,
|
jpayne@68
|
908 best_score,
|
jpayne@68
|
909 score_matrix,
|
jpayne@68
|
910 trace_matrix,
|
jpayne@68
|
911 align_globally,
|
jpayne@68
|
912 gap_char,
|
jpayne@68
|
913 one_alignment_only,
|
jpayne@68
|
914 gap_A_fn,
|
jpayne@68
|
915 gap_B_fn,
|
jpayne@68
|
916 reverse=False,
|
jpayne@68
|
917 ):
|
jpayne@68
|
918 """Do the backtracing and return a list of alignments (PRIVATE).
|
jpayne@68
|
919
|
jpayne@68
|
920 Recover the alignments by following the traceback matrix. This
|
jpayne@68
|
921 is a recursive procedure, but it's implemented here iteratively
|
jpayne@68
|
922 with a stack.
|
jpayne@68
|
923
|
jpayne@68
|
924 sequenceA and sequenceB may be sequences, including strings,
|
jpayne@68
|
925 lists, or list-like objects. In order to preserve the type of
|
jpayne@68
|
926 the object, we need to use slices on the sequences instead of
|
jpayne@68
|
927 indexes. For example, sequenceA[row] may return a type that's
|
jpayne@68
|
928 not compatible with sequenceA, e.g. if sequenceA is a list and
|
jpayne@68
|
929 sequenceA[row] is a string. Thus, avoid using indexes and use
|
jpayne@68
|
930 slices, e.g. sequenceA[row:row+1]. Assume that client-defined
|
jpayne@68
|
931 sequence classes preserve these semantics.
|
jpayne@68
|
932 """
|
jpayne@68
|
933 lenA, lenB = len(sequenceA), len(sequenceB)
|
jpayne@68
|
934 ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0]
|
jpayne@68
|
935 tracebacks = []
|
jpayne@68
|
936 in_process = []
|
jpayne@68
|
937
|
jpayne@68
|
938 for start in starts:
|
jpayne@68
|
939 score, (row, col) = start
|
jpayne@68
|
940 begin = 0
|
jpayne@68
|
941 if align_globally:
|
jpayne@68
|
942 end = None
|
jpayne@68
|
943 else:
|
jpayne@68
|
944 # If this start is a zero-extension: don't start here!
|
jpayne@68
|
945 if (score, (row - 1, col - 1)) in starts:
|
jpayne@68
|
946 continue
|
jpayne@68
|
947 # Local alignments should start with a positive score!
|
jpayne@68
|
948 if score <= 0:
|
jpayne@68
|
949 continue
|
jpayne@68
|
950 # Local alignments should not end with a gap!:
|
jpayne@68
|
951 trace = trace_matrix[row][col]
|
jpayne@68
|
952 if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine!
|
jpayne@68
|
953 trace_matrix[row][col] = 2
|
jpayne@68
|
954 # If not, don't start here!
|
jpayne@68
|
955 else:
|
jpayne@68
|
956 continue
|
jpayne@68
|
957 end = -max(lenA - row, lenB - col)
|
jpayne@68
|
958 if not end:
|
jpayne@68
|
959 end = None
|
jpayne@68
|
960 col_distance = lenB - col
|
jpayne@68
|
961 row_distance = lenA - row
|
jpayne@68
|
962
|
jpayne@68
|
963 # fmt: off
|
jpayne@68
|
964 ali_seqA = (
|
jpayne@68
|
965 (col_distance - row_distance) * gap_char
|
jpayne@68
|
966 + sequenceA[lenA - 1 : row - 1 : -1]
|
jpayne@68
|
967 )
|
jpayne@68
|
968 ali_seqB = (
|
jpayne@68
|
969 (row_distance - col_distance) * gap_char
|
jpayne@68
|
970 + sequenceB[lenB - 1 : col - 1 : -1]
|
jpayne@68
|
971 )
|
jpayne@68
|
972 # fmt: on
|
jpayne@68
|
973 in_process += [
|
jpayne@68
|
974 (ali_seqA, ali_seqB, end, row, col, False, trace_matrix[row][col])
|
jpayne@68
|
975 ]
|
jpayne@68
|
976 while in_process and len(tracebacks) < MAX_ALIGNMENTS:
|
jpayne@68
|
977 # Although we allow a gap in seqB to be followed by a gap in seqA,
|
jpayne@68
|
978 # we don't want to allow it the other way round, since this would
|
jpayne@68
|
979 # give redundant alignments of type: A- vs. -A
|
jpayne@68
|
980 # -B B-
|
jpayne@68
|
981 # Thus we need to keep track if a gap in seqA was opened (col_gap)
|
jpayne@68
|
982 # and stop the backtrace (dead_end) if a gap in seqB follows.
|
jpayne@68
|
983 #
|
jpayne@68
|
984 # Attention: This may fail, if the gap-penalties for both strands are
|
jpayne@68
|
985 # different. In this case the second alignment may be the only optimal
|
jpayne@68
|
986 # alignment. Thus it can happen that no alignment is returned. For
|
jpayne@68
|
987 # this case a workaround was implemented, which reverses the input and
|
jpayne@68
|
988 # the matrices (this happens in _reverse_matrices) and repeats the
|
jpayne@68
|
989 # backtrace. The variable 'reverse' keeps track of this.
|
jpayne@68
|
990 dead_end = False
|
jpayne@68
|
991 ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop()
|
jpayne@68
|
992
|
jpayne@68
|
993 while (row > 0 or col > 0) and not dead_end:
|
jpayne@68
|
994 cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap)
|
jpayne@68
|
995
|
jpayne@68
|
996 # If trace is empty we have reached at least one border of the
|
jpayne@68
|
997 # matrix or the end of a local alignment. Just add the rest of
|
jpayne@68
|
998 # the sequence(s) and fill with gaps if necessary.
|
jpayne@68
|
999 if not trace:
|
jpayne@68
|
1000 if col and col_gap:
|
jpayne@68
|
1001 dead_end = True
|
jpayne@68
|
1002 else:
|
jpayne@68
|
1003 ali_seqA, ali_seqB = _finish_backtrace(
|
jpayne@68
|
1004 sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, gap_char
|
jpayne@68
|
1005 )
|
jpayne@68
|
1006 break
|
jpayne@68
|
1007 elif trace % 2 == 1: # = row open = open gap in seqA
|
jpayne@68
|
1008 trace -= 1
|
jpayne@68
|
1009 if col_gap:
|
jpayne@68
|
1010 dead_end = True
|
jpayne@68
|
1011 else:
|
jpayne@68
|
1012 col -= 1
|
jpayne@68
|
1013 ali_seqA += gap_char
|
jpayne@68
|
1014 ali_seqB += sequenceB[col : col + 1]
|
jpayne@68
|
1015 col_gap = False
|
jpayne@68
|
1016 elif trace % 4 == 2: # = match/mismatch of seqA with seqB
|
jpayne@68
|
1017 trace -= 2
|
jpayne@68
|
1018 row -= 1
|
jpayne@68
|
1019 col -= 1
|
jpayne@68
|
1020 ali_seqA += sequenceA[row : row + 1]
|
jpayne@68
|
1021 ali_seqB += sequenceB[col : col + 1]
|
jpayne@68
|
1022 col_gap = False
|
jpayne@68
|
1023 elif trace % 8 == 4: # = col open = open gap in seqB
|
jpayne@68
|
1024 trace -= 4
|
jpayne@68
|
1025 row -= 1
|
jpayne@68
|
1026 ali_seqA += sequenceA[row : row + 1]
|
jpayne@68
|
1027 ali_seqB += gap_char
|
jpayne@68
|
1028 col_gap = True
|
jpayne@68
|
1029 elif trace in (8, 24): # = row extend = extend gap in seqA
|
jpayne@68
|
1030 trace -= 8
|
jpayne@68
|
1031 if col_gap:
|
jpayne@68
|
1032 dead_end = True
|
jpayne@68
|
1033 else:
|
jpayne@68
|
1034 col_gap = False
|
jpayne@68
|
1035 # We need to find the starting point of the extended gap
|
jpayne@68
|
1036 x = _find_gap_open(
|
jpayne@68
|
1037 sequenceA,
|
jpayne@68
|
1038 sequenceB,
|
jpayne@68
|
1039 ali_seqA,
|
jpayne@68
|
1040 ali_seqB,
|
jpayne@68
|
1041 end,
|
jpayne@68
|
1042 row,
|
jpayne@68
|
1043 col,
|
jpayne@68
|
1044 col_gap,
|
jpayne@68
|
1045 gap_char,
|
jpayne@68
|
1046 score_matrix,
|
jpayne@68
|
1047 trace_matrix,
|
jpayne@68
|
1048 in_process,
|
jpayne@68
|
1049 gap_A_fn,
|
jpayne@68
|
1050 col,
|
jpayne@68
|
1051 row,
|
jpayne@68
|
1052 "col",
|
jpayne@68
|
1053 best_score,
|
jpayne@68
|
1054 align_globally,
|
jpayne@68
|
1055 )
|
jpayne@68
|
1056 ali_seqA, ali_seqB, row, col, in_process, dead_end = x
|
jpayne@68
|
1057 elif trace == 16: # = col extend = extend gap in seqB
|
jpayne@68
|
1058 trace -= 16
|
jpayne@68
|
1059 col_gap = True
|
jpayne@68
|
1060 x = _find_gap_open(
|
jpayne@68
|
1061 sequenceA,
|
jpayne@68
|
1062 sequenceB,
|
jpayne@68
|
1063 ali_seqA,
|
jpayne@68
|
1064 ali_seqB,
|
jpayne@68
|
1065 end,
|
jpayne@68
|
1066 row,
|
jpayne@68
|
1067 col,
|
jpayne@68
|
1068 col_gap,
|
jpayne@68
|
1069 gap_char,
|
jpayne@68
|
1070 score_matrix,
|
jpayne@68
|
1071 trace_matrix,
|
jpayne@68
|
1072 in_process,
|
jpayne@68
|
1073 gap_B_fn,
|
jpayne@68
|
1074 row,
|
jpayne@68
|
1075 col,
|
jpayne@68
|
1076 "row",
|
jpayne@68
|
1077 best_score,
|
jpayne@68
|
1078 align_globally,
|
jpayne@68
|
1079 )
|
jpayne@68
|
1080 ali_seqA, ali_seqB, row, col, in_process, dead_end = x
|
jpayne@68
|
1081
|
jpayne@68
|
1082 if trace: # There is another path to follow...
|
jpayne@68
|
1083 cache += (trace,)
|
jpayne@68
|
1084 in_process.append(cache)
|
jpayne@68
|
1085 trace = trace_matrix[row][col]
|
jpayne@68
|
1086 if not align_globally:
|
jpayne@68
|
1087 if score_matrix[row][col] == best_score:
|
jpayne@68
|
1088 # We have gone through a 'zero-score' extension, discard it
|
jpayne@68
|
1089 dead_end = True
|
jpayne@68
|
1090 elif score_matrix[row][col] <= 0:
|
jpayne@68
|
1091 # We have reached the end of the backtrace
|
jpayne@68
|
1092 begin = max(row, col)
|
jpayne@68
|
1093 trace = 0
|
jpayne@68
|
1094 if not dead_end:
|
jpayne@68
|
1095 if not reverse:
|
jpayne@68
|
1096 tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, begin, end))
|
jpayne@68
|
1097 else:
|
jpayne@68
|
1098 tracebacks.append((ali_seqB[::-1], ali_seqA[::-1], score, begin, end))
|
jpayne@68
|
1099 if one_alignment_only:
|
jpayne@68
|
1100 break
|
jpayne@68
|
1101 return _clean_alignments(tracebacks)
|
jpayne@68
|
1102
|
jpayne@68
|
1103
|
jpayne@68
|
1104 def _find_start(score_matrix, best_score, align_globally):
|
jpayne@68
|
1105 """Return a list of starting points (score, (row, col)) (PRIVATE).
|
jpayne@68
|
1106
|
jpayne@68
|
1107 Indicating every possible place to start the tracebacks.
|
jpayne@68
|
1108 """
|
jpayne@68
|
1109 nrows, ncols = len(score_matrix), len(score_matrix[0])
|
jpayne@68
|
1110 # In this implementation of the global algorithm, the start will always be
|
jpayne@68
|
1111 # the bottom right corner of the matrix.
|
jpayne@68
|
1112 if align_globally:
|
jpayne@68
|
1113 starts = [(best_score, (nrows - 1, ncols - 1))]
|
jpayne@68
|
1114 else:
|
jpayne@68
|
1115 # For local alignments, there may be many different start points.
|
jpayne@68
|
1116 starts = []
|
jpayne@68
|
1117 tolerance = 0 # XXX do anything with this?
|
jpayne@68
|
1118 # Now find all the positions within some tolerance of the best
|
jpayne@68
|
1119 # score.
|
jpayne@68
|
1120 for row in range(nrows):
|
jpayne@68
|
1121 for col in range(ncols):
|
jpayne@68
|
1122 score = score_matrix[row][col]
|
jpayne@68
|
1123 if rint(abs(score - best_score)) <= rint(tolerance):
|
jpayne@68
|
1124 starts.append((score, (row, col)))
|
jpayne@68
|
1125 return starts
|
jpayne@68
|
1126
|
jpayne@68
|
1127
|
jpayne@68
|
1128 def _reverse_matrices(score_matrix, trace_matrix):
|
jpayne@68
|
1129 """Reverse score and trace matrices (PRIVATE)."""
|
jpayne@68
|
1130 reverse_score_matrix = []
|
jpayne@68
|
1131 reverse_trace_matrix = []
|
jpayne@68
|
1132 # fmt: off
|
jpayne@68
|
1133 reverse_trace = {
|
jpayne@68
|
1134 1: 4, 2: 2, 3: 6, 4: 1, 5: 5, 6: 3, 7: 7, 8: 16, 9: 20, 10: 18, 11: 22, 12: 17,
|
jpayne@68
|
1135 13: 21, 14: 19, 15: 23, 16: 8, 17: 12, 18: 10, 19: 14, 20: 9, 21: 13, 22: 11,
|
jpayne@68
|
1136 23: 15, 24: 24, 25: 28, 26: 26, 27: 30, 28: 25, 29: 29, 30: 27, 31: 31,
|
jpayne@68
|
1137 None: None,
|
jpayne@68
|
1138 }
|
jpayne@68
|
1139 # fmt: on
|
jpayne@68
|
1140 for col in range(len(score_matrix[0])):
|
jpayne@68
|
1141 new_score_row = []
|
jpayne@68
|
1142 new_trace_row = []
|
jpayne@68
|
1143 for row in range(len(score_matrix)):
|
jpayne@68
|
1144 new_score_row.append(score_matrix[row][col])
|
jpayne@68
|
1145 new_trace_row.append(reverse_trace[trace_matrix[row][col]])
|
jpayne@68
|
1146 reverse_score_matrix.append(new_score_row)
|
jpayne@68
|
1147 reverse_trace_matrix.append(new_trace_row)
|
jpayne@68
|
1148 return reverse_score_matrix, reverse_trace_matrix
|
jpayne@68
|
1149
|
jpayne@68
|
1150
|
jpayne@68
|
1151 def _clean_alignments(alignments):
|
jpayne@68
|
1152 """Take a list of alignments and return a cleaned version (PRIVATE).
|
jpayne@68
|
1153
|
jpayne@68
|
1154 Remove duplicates, make sure begin and end are set correctly, remove
|
jpayne@68
|
1155 empty alignments.
|
jpayne@68
|
1156 """
|
jpayne@68
|
1157 unique_alignments = []
|
jpayne@68
|
1158 for align in alignments:
|
jpayne@68
|
1159 if align not in unique_alignments:
|
jpayne@68
|
1160 unique_alignments.append(align)
|
jpayne@68
|
1161 i = 0
|
jpayne@68
|
1162 while i < len(unique_alignments):
|
jpayne@68
|
1163 seqA, seqB, score, begin, end = unique_alignments[i]
|
jpayne@68
|
1164 # Make sure end is set reasonably.
|
jpayne@68
|
1165 if end is None: # global alignment
|
jpayne@68
|
1166 end = len(seqA)
|
jpayne@68
|
1167 elif end < 0:
|
jpayne@68
|
1168 end = end + len(seqA)
|
jpayne@68
|
1169 # If there's no alignment here, get rid of it.
|
jpayne@68
|
1170 if begin >= end:
|
jpayne@68
|
1171 del unique_alignments[i]
|
jpayne@68
|
1172 continue
|
jpayne@68
|
1173 unique_alignments[i] = Alignment(seqA, seqB, score, begin, end)
|
jpayne@68
|
1174 i += 1
|
jpayne@68
|
1175 return unique_alignments
|
jpayne@68
|
1176
|
jpayne@68
|
1177
|
jpayne@68
|
1178 def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, gap_char):
|
jpayne@68
|
1179 """Add remaining sequences and fill with gaps if necessary (PRIVATE)."""
|
jpayne@68
|
1180 if row:
|
jpayne@68
|
1181 ali_seqA += sequenceA[row - 1 :: -1]
|
jpayne@68
|
1182 if col:
|
jpayne@68
|
1183 ali_seqB += sequenceB[col - 1 :: -1]
|
jpayne@68
|
1184 if row > col:
|
jpayne@68
|
1185 ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB))
|
jpayne@68
|
1186 elif col > row:
|
jpayne@68
|
1187 ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA))
|
jpayne@68
|
1188 return ali_seqA, ali_seqB
|
jpayne@68
|
1189
|
jpayne@68
|
1190
|
jpayne@68
|
1191 def _find_gap_open(
|
jpayne@68
|
1192 sequenceA,
|
jpayne@68
|
1193 sequenceB,
|
jpayne@68
|
1194 ali_seqA,
|
jpayne@68
|
1195 ali_seqB,
|
jpayne@68
|
1196 end,
|
jpayne@68
|
1197 row,
|
jpayne@68
|
1198 col,
|
jpayne@68
|
1199 col_gap,
|
jpayne@68
|
1200 gap_char,
|
jpayne@68
|
1201 score_matrix,
|
jpayne@68
|
1202 trace_matrix,
|
jpayne@68
|
1203 in_process,
|
jpayne@68
|
1204 gap_fn,
|
jpayne@68
|
1205 target,
|
jpayne@68
|
1206 index,
|
jpayne@68
|
1207 direction,
|
jpayne@68
|
1208 best_score,
|
jpayne@68
|
1209 align_globally,
|
jpayne@68
|
1210 ):
|
jpayne@68
|
1211 """Find the starting point(s) of the extended gap (PRIVATE)."""
|
jpayne@68
|
1212 dead_end = False
|
jpayne@68
|
1213 target_score = score_matrix[row][col]
|
jpayne@68
|
1214 for n in range(target):
|
jpayne@68
|
1215 if direction == "col":
|
jpayne@68
|
1216 col -= 1
|
jpayne@68
|
1217 ali_seqA += gap_char
|
jpayne@68
|
1218 ali_seqB += sequenceB[col : col + 1]
|
jpayne@68
|
1219 else:
|
jpayne@68
|
1220 row -= 1
|
jpayne@68
|
1221 ali_seqA += sequenceA[row : row + 1]
|
jpayne@68
|
1222 ali_seqB += gap_char
|
jpayne@68
|
1223 actual_score = score_matrix[row][col] + gap_fn(index, n + 1)
|
jpayne@68
|
1224 if not align_globally and score_matrix[row][col] == best_score:
|
jpayne@68
|
1225 # We have run through a 'zero-score' extension and discard it
|
jpayne@68
|
1226 dead_end = True
|
jpayne@68
|
1227 break
|
jpayne@68
|
1228 if rint(actual_score) == rint(target_score) and n > 0:
|
jpayne@68
|
1229 if not trace_matrix[row][col]:
|
jpayne@68
|
1230 break
|
jpayne@68
|
1231 else:
|
jpayne@68
|
1232 in_process.append(
|
jpayne@68
|
1233 (
|
jpayne@68
|
1234 ali_seqA[:],
|
jpayne@68
|
1235 ali_seqB[:],
|
jpayne@68
|
1236 end,
|
jpayne@68
|
1237 row,
|
jpayne@68
|
1238 col,
|
jpayne@68
|
1239 col_gap,
|
jpayne@68
|
1240 trace_matrix[row][col],
|
jpayne@68
|
1241 )
|
jpayne@68
|
1242 )
|
jpayne@68
|
1243 if not trace_matrix[row][col]:
|
jpayne@68
|
1244 dead_end = True
|
jpayne@68
|
1245 return ali_seqA, ali_seqB, row, col, in_process, dead_end
|
jpayne@68
|
1246
|
jpayne@68
|
1247
|
jpayne@68
|
1248 _PRECISION = 1000
|
jpayne@68
|
1249
|
jpayne@68
|
1250
|
jpayne@68
|
1251 def rint(x, precision=_PRECISION):
|
jpayne@68
|
1252 """Print number with declared precision."""
|
jpayne@68
|
1253 return int(x * precision + 0.5)
|
jpayne@68
|
1254
|
jpayne@68
|
1255
|
jpayne@68
|
1256 class identity_match:
|
jpayne@68
|
1257 """Create a match function for use in an alignment.
|
jpayne@68
|
1258
|
jpayne@68
|
1259 match and mismatch are the scores to give when two residues are equal
|
jpayne@68
|
1260 or unequal. By default, match is 1 and mismatch is 0.
|
jpayne@68
|
1261 """
|
jpayne@68
|
1262
|
jpayne@68
|
1263 def __init__(self, match=1, mismatch=0):
|
jpayne@68
|
1264 """Initialize the class."""
|
jpayne@68
|
1265 self.match = match
|
jpayne@68
|
1266 self.mismatch = mismatch
|
jpayne@68
|
1267
|
jpayne@68
|
1268 def __call__(self, charA, charB):
|
jpayne@68
|
1269 """Call a match function instance already created."""
|
jpayne@68
|
1270 if charA == charB:
|
jpayne@68
|
1271 return self.match
|
jpayne@68
|
1272 return self.mismatch
|
jpayne@68
|
1273
|
jpayne@68
|
1274
|
jpayne@68
|
1275 class dictionary_match:
|
jpayne@68
|
1276 """Create a match function for use in an alignment.
|
jpayne@68
|
1277
|
jpayne@68
|
1278 Attributes:
|
jpayne@68
|
1279 - score_dict - A dictionary where the keys are tuples (residue 1,
|
jpayne@68
|
1280 residue 2) and the values are the match scores between those residues.
|
jpayne@68
|
1281 - symmetric - A flag that indicates whether the scores are symmetric.
|
jpayne@68
|
1282
|
jpayne@68
|
1283 """
|
jpayne@68
|
1284
|
jpayne@68
|
1285 def __init__(self, score_dict, symmetric=1):
|
jpayne@68
|
1286 """Initialize the class."""
|
jpayne@68
|
1287 if isinstance(score_dict, substitution_matrices.Array):
|
jpayne@68
|
1288 score_dict = dict(score_dict) # Access to dict is much faster
|
jpayne@68
|
1289 self.score_dict = score_dict
|
jpayne@68
|
1290 self.symmetric = symmetric
|
jpayne@68
|
1291
|
jpayne@68
|
1292 def __call__(self, charA, charB):
|
jpayne@68
|
1293 """Call a dictionary match instance already created."""
|
jpayne@68
|
1294 if self.symmetric and (charA, charB) not in self.score_dict:
|
jpayne@68
|
1295 # If the score dictionary is symmetric, then look up the
|
jpayne@68
|
1296 # score both ways.
|
jpayne@68
|
1297 charB, charA = charA, charB
|
jpayne@68
|
1298 return self.score_dict[(charA, charB)]
|
jpayne@68
|
1299
|
jpayne@68
|
1300
|
jpayne@68
|
1301 class affine_penalty:
|
jpayne@68
|
1302 """Create a gap function for use in an alignment."""
|
jpayne@68
|
1303
|
jpayne@68
|
1304 def __init__(self, open, extend, penalize_extend_when_opening=0):
|
jpayne@68
|
1305 """Initialize the class."""
|
jpayne@68
|
1306 if open > 0 or extend > 0:
|
jpayne@68
|
1307 raise ValueError("Gap penalties should be non-positive.")
|
jpayne@68
|
1308 if not penalize_extend_when_opening and (extend < open):
|
jpayne@68
|
1309 raise ValueError(
|
jpayne@68
|
1310 "Gap opening penalty should be higher than "
|
jpayne@68
|
1311 "gap extension penalty (or equal)"
|
jpayne@68
|
1312 )
|
jpayne@68
|
1313 self.open, self.extend = open, extend
|
jpayne@68
|
1314 self.penalize_extend_when_opening = penalize_extend_when_opening
|
jpayne@68
|
1315
|
jpayne@68
|
1316 def __call__(self, index, length):
|
jpayne@68
|
1317 """Call a gap function instance already created."""
|
jpayne@68
|
1318 return calc_affine_penalty(
|
jpayne@68
|
1319 length, self.open, self.extend, self.penalize_extend_when_opening
|
jpayne@68
|
1320 )
|
jpayne@68
|
1321
|
jpayne@68
|
1322
|
jpayne@68
|
1323 def calc_affine_penalty(length, open, extend, penalize_extend_when_opening):
|
jpayne@68
|
1324 """Calculate a penalty score for the gap function."""
|
jpayne@68
|
1325 if length <= 0:
|
jpayne@68
|
1326 return 0.0
|
jpayne@68
|
1327 penalty = open + extend * length
|
jpayne@68
|
1328 if not penalize_extend_when_opening:
|
jpayne@68
|
1329 penalty -= extend
|
jpayne@68
|
1330 return penalty
|
jpayne@68
|
1331
|
jpayne@68
|
1332
|
jpayne@68
|
1333 def print_matrix(matrix):
|
jpayne@68
|
1334 """Print out a matrix for debugging purposes."""
|
jpayne@68
|
1335 # Transpose the matrix and get the length of the values in each column.
|
jpayne@68
|
1336 matrixT = [[] for x in range(len(matrix[0]))]
|
jpayne@68
|
1337 for i in range(len(matrix)):
|
jpayne@68
|
1338 for j in range(len(matrix[i])):
|
jpayne@68
|
1339 matrixT[j].append(len(str(matrix[i][j])))
|
jpayne@68
|
1340 ndigits = [max(x) for x in matrixT]
|
jpayne@68
|
1341 for i in range(len(matrix)):
|
jpayne@68
|
1342 # Using string formatting trick to add leading spaces,
|
jpayne@68
|
1343 print(
|
jpayne@68
|
1344 " ".join("%*s " % (ndigits[j], matrix[i][j]) for j in range(len(matrix[i])))
|
jpayne@68
|
1345 )
|
jpayne@68
|
1346
|
jpayne@68
|
1347
|
jpayne@68
|
1348 def format_alignment(align1, align2, score, begin, end, full_sequences=False):
|
jpayne@68
|
1349 """Format the alignment prettily into a string.
|
jpayne@68
|
1350
|
jpayne@68
|
1351 IMPORTANT: Gap symbol must be "-" (or ['-'] for lists)!
|
jpayne@68
|
1352
|
jpayne@68
|
1353 Since Biopython 1.71 identical matches are shown with a pipe
|
jpayne@68
|
1354 character, mismatches as a dot, and gaps as a space.
|
jpayne@68
|
1355
|
jpayne@68
|
1356 Prior releases just used the pipe character to indicate the
|
jpayne@68
|
1357 aligned region (matches, mismatches and gaps).
|
jpayne@68
|
1358
|
jpayne@68
|
1359 Also, in local alignments, if the alignment does not include
|
jpayne@68
|
1360 the whole sequences, now only the aligned part is shown,
|
jpayne@68
|
1361 together with the start positions of the aligned subsequences.
|
jpayne@68
|
1362 The start positions are 1-based; so start position n is the
|
jpayne@68
|
1363 n-th base/amino acid in the *un-aligned* sequence.
|
jpayne@68
|
1364
|
jpayne@68
|
1365 NOTE: This is different to the alignment's begin/end values,
|
jpayne@68
|
1366 which give the Python indices (0-based) of the bases/amino acids
|
jpayne@68
|
1367 in the *aligned* sequences.
|
jpayne@68
|
1368
|
jpayne@68
|
1369 If you want to restore the 'historic' behaviour, that means
|
jpayne@68
|
1370 displaying the whole sequences (including the non-aligned parts),
|
jpayne@68
|
1371 use ``full_sequences=True``. In this case, the non-aligned leading
|
jpayne@68
|
1372 and trailing parts are also indicated by spaces in the match-line.
|
jpayne@68
|
1373 """
|
jpayne@68
|
1374 align_begin = begin
|
jpayne@68
|
1375 align_end = end
|
jpayne@68
|
1376 start1 = start2 = ""
|
jpayne@68
|
1377 start_m = begin # Begin of match line (how many spaces to include)
|
jpayne@68
|
1378 # For local alignments:
|
jpayne@68
|
1379 if not full_sequences and (begin != 0 or end != len(align1)):
|
jpayne@68
|
1380 # Calculate the actual start positions in the un-aligned sequences
|
jpayne@68
|
1381 # This will only work if the gap symbol is '-' or ['-']!
|
jpayne@68
|
1382 start1 = str(len(align1[:begin]) - align1[:begin].count("-") + 1) + " "
|
jpayne@68
|
1383 start2 = str(len(align2[:begin]) - align2[:begin].count("-") + 1) + " "
|
jpayne@68
|
1384 start_m = max(len(start1), len(start2))
|
jpayne@68
|
1385 elif full_sequences:
|
jpayne@68
|
1386 start_m = 0
|
jpayne@68
|
1387 begin = 0
|
jpayne@68
|
1388 end = len(align1)
|
jpayne@68
|
1389
|
jpayne@68
|
1390 if isinstance(align1, list):
|
jpayne@68
|
1391 # List elements will be separated by spaces, since they can be
|
jpayne@68
|
1392 # of different lengths
|
jpayne@68
|
1393 align1 = [a + " " for a in align1]
|
jpayne@68
|
1394 align2 = [a + " " for a in align2]
|
jpayne@68
|
1395
|
jpayne@68
|
1396 s1_line = ["{:>{width}}".format(start1, width=start_m)] # seq1 line
|
jpayne@68
|
1397 m_line = [" " * start_m] # match line
|
jpayne@68
|
1398 s2_line = ["{:>{width}}".format(start2, width=start_m)] # seq2 line
|
jpayne@68
|
1399
|
jpayne@68
|
1400 for n, (a, b) in enumerate(zip(align1[begin:end], align2[begin:end])):
|
jpayne@68
|
1401 # Since list elements can be of different length, we center them,
|
jpayne@68
|
1402 # using the maximum length of the two compared elements as width
|
jpayne@68
|
1403 m_len = max(len(a), len(b))
|
jpayne@68
|
1404 s1_line.append("{:^{width}}".format(a, width=m_len))
|
jpayne@68
|
1405 s2_line.append("{:^{width}}".format(b, width=m_len))
|
jpayne@68
|
1406 if full_sequences and (n < align_begin or n >= align_end):
|
jpayne@68
|
1407 m_line.append("{:^{width}}".format(" ", width=m_len)) # space
|
jpayne@68
|
1408 continue
|
jpayne@68
|
1409 if a == b:
|
jpayne@68
|
1410 m_line.append("{:^{width}}".format("|", width=m_len)) # match
|
jpayne@68
|
1411 elif a.strip() == "-" or b.strip() == "-":
|
jpayne@68
|
1412 m_line.append("{:^{width}}".format(" ", width=m_len)) # gap
|
jpayne@68
|
1413 else:
|
jpayne@68
|
1414 m_line.append("{:^{width}}".format(".", width=m_len)) # mismatch
|
jpayne@68
|
1415
|
jpayne@68
|
1416 s2_line.append(f"\n Score={score:g}\n")
|
jpayne@68
|
1417 return "\n".join(["".join(s1_line), "".join(m_line), "".join(s2_line)])
|
jpayne@68
|
1418
|
jpayne@68
|
1419
|
jpayne@68
|
1420 # Try and load C implementations of functions. If I can't,
|
jpayne@68
|
1421 # then throw a warning and use the pure Python implementations.
|
jpayne@68
|
1422 # The redefinition is deliberate, thus the no quality assurance
|
jpayne@68
|
1423 # flag for when using flake8.
|
jpayne@68
|
1424 # Before, we secure access to the pure Python functions (for testing purposes):
|
jpayne@68
|
1425
|
jpayne@68
|
1426 _python_make_score_matrix_fast = _make_score_matrix_fast
|
jpayne@68
|
1427 _python_rint = rint
|
jpayne@68
|
1428
|
jpayne@68
|
1429 try:
|
jpayne@68
|
1430 from .cpairwise2 import rint, _make_score_matrix_fast # noqa
|
jpayne@68
|
1431 except ImportError:
|
jpayne@68
|
1432 warnings.warn(
|
jpayne@68
|
1433 "Import of C module failed. Falling back to pure Python "
|
jpayne@68
|
1434 "implementation. This may be slooow...",
|
jpayne@68
|
1435 BiopythonWarning,
|
jpayne@68
|
1436 )
|
jpayne@68
|
1437
|
jpayne@68
|
1438 if __name__ == "__main__":
|
jpayne@68
|
1439 from Bio._utils import run_doctest
|
jpayne@68
|
1440
|
jpayne@68
|
1441 run_doctest()
|