jpayne@68: # Copyright 2002 by Jeffrey Chang. jpayne@68: # Copyright 2016, 2019, 2020 by Markus Piotrowski. jpayne@68: # All rights reserved. jpayne@68: # jpayne@68: # This file is part of the Biopython distribution and governed by your jpayne@68: # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". jpayne@68: # Please see the LICENSE file that should have been included as part of this jpayne@68: # package. jpayne@68: jpayne@68: """Pairwise sequence alignment using a dynamic programming algorithm. jpayne@68: jpayne@68: This provides functions to get global and local alignments between two jpayne@68: sequences. A global alignment finds the best concordance between all jpayne@68: characters in two sequences. A local alignment finds just the jpayne@68: subsequences that align the best. Local alignments must have a positive jpayne@68: score to be reported and they will not be extended for 'zero counting' jpayne@68: matches. This means a local alignment will always start and end with jpayne@68: a positive counting match. jpayne@68: jpayne@68: When doing alignments, you can specify the match score and gap jpayne@68: penalties. The match score indicates the compatibility between an jpayne@68: alignment of two characters in the sequences. Highly compatible jpayne@68: characters should be given positive scores, and incompatible ones jpayne@68: should be given negative scores or 0. The gap penalties should be jpayne@68: negative. jpayne@68: jpayne@68: The names of the alignment functions in this module follow the jpayne@68: convention jpayne@68: XX jpayne@68: where is either "global" or "local" and XX is a 2 jpayne@68: character code indicating the parameters it takes. The first jpayne@68: character indicates the parameters for matches (and mismatches), and jpayne@68: the second indicates the parameters for gap penalties. jpayne@68: jpayne@68: The match parameters are:: jpayne@68: jpayne@68: CODE DESCRIPTION & OPTIONAL KEYWORDS jpayne@68: x No parameters. Identical characters have score of 1, otherwise 0. jpayne@68: m A match score is the score of identical chars, otherwise mismatch jpayne@68: score. Keywords ``match``, ``mismatch``. jpayne@68: d A dictionary returns the score of any pair of characters. jpayne@68: Keyword ``match_dict``. jpayne@68: c A callback function returns scores. Keyword ``match_fn``. jpayne@68: jpayne@68: The gap penalty parameters are:: jpayne@68: jpayne@68: CODE DESCRIPTION & OPTIONAL KEYWORDS jpayne@68: x No gap penalties. jpayne@68: s Same open and extend gap penalties for both sequences. jpayne@68: Keywords ``open``, ``extend``. jpayne@68: d The sequences have different open and extend gap penalties. jpayne@68: Keywords ``openA``, ``extendA``, ``openB``, ``extendB``. jpayne@68: c A callback function returns the gap penalties. jpayne@68: Keywords ``gap_A_fn``, ``gap_B_fn``. jpayne@68: jpayne@68: All the different alignment functions are contained in an object jpayne@68: ``align``. For example: jpayne@68: jpayne@68: >>> from Bio import pairwise2 jpayne@68: >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG") jpayne@68: jpayne@68: For better readability, the required arguments can be used with optional keywords: jpayne@68: jpayne@68: >>> alignments = pairwise2.align.globalxx(sequenceA="ACCGT", sequenceB="ACG") jpayne@68: jpayne@68: The result is a list of the alignments between the two strings. Each alignment jpayne@68: is a named tuple consisting of the two aligned sequences, the score and the jpayne@68: start and end positions of the alignment: jpayne@68: jpayne@68: >>> print(alignments) jpayne@68: [Alignment(seqA='ACCGT', seqB='A-CG-', score=3.0, start=0, end=5), ... jpayne@68: jpayne@68: You can access each element of an alignment by index or name: jpayne@68: jpayne@68: >>> alignments[0][2] jpayne@68: 3.0 jpayne@68: >>> alignments[0].score jpayne@68: 3.0 jpayne@68: jpayne@68: For a nice printout of an alignment, use the ``format_alignment`` method of jpayne@68: the module: jpayne@68: jpayne@68: >>> from Bio.pairwise2 import format_alignment jpayne@68: >>> print(format_alignment(*alignments[0])) jpayne@68: ACCGT jpayne@68: | || jpayne@68: A-CG- jpayne@68: Score=3 jpayne@68: jpayne@68: jpayne@68: All alignment functions have the following arguments: jpayne@68: jpayne@68: - Two sequences: strings, Biopython sequence objects or lists. jpayne@68: Lists are useful for supplying sequences which contain residues that are jpayne@68: encoded by more than one letter. jpayne@68: jpayne@68: - ``penalize_extend_when_opening``: boolean (default: False). jpayne@68: Whether to count an extension penalty when opening a gap. If false, a gap of jpayne@68: 1 is only penalized an "open" penalty, otherwise it is penalized jpayne@68: "open+extend". jpayne@68: jpayne@68: - ``penalize_end_gaps``: boolean. jpayne@68: Whether to count the gaps at the ends of an alignment. By default, they are jpayne@68: counted for global alignments but not for local ones. Setting jpayne@68: ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the jpayne@68: two sequences separately whether gaps at the end of the alignment should be jpayne@68: counted. jpayne@68: jpayne@68: - ``gap_char``: string (default: ``'-'``). jpayne@68: Which character to use as a gap character in the alignment returned. If your jpayne@68: input sequences are lists, you must change this to ``['-']``. jpayne@68: jpayne@68: - ``force_generic``: boolean (default: False). jpayne@68: Always use the generic, non-cached, dynamic programming function (slow!). jpayne@68: For debugging. jpayne@68: jpayne@68: - ``score_only``: boolean (default: False). jpayne@68: Only get the best score, don't recover any alignments. The return value of jpayne@68: the function is the score. Faster and uses less memory. jpayne@68: jpayne@68: - ``one_alignment_only``: boolean (default: False). jpayne@68: Only recover one alignment. jpayne@68: jpayne@68: The other parameters of the alignment function depend on the function called. jpayne@68: Some examples: jpayne@68: jpayne@68: - Find the best global alignment between the two sequences. Identical jpayne@68: characters are given 1 point. No points are deducted for mismatches or gaps. jpayne@68: jpayne@68: >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"): jpayne@68: ... print(format_alignment(*a)) jpayne@68: ACCGT jpayne@68: | || jpayne@68: A-CG- jpayne@68: Score=3 jpayne@68: jpayne@68: ACCGT jpayne@68: || | jpayne@68: AC-G- jpayne@68: Score=3 jpayne@68: jpayne@68: jpayne@68: - Same thing as before, but with a local alignment. Note that jpayne@68: ``format_alignment`` will only show the aligned parts of the sequences, jpayne@68: together with the starting positions. jpayne@68: jpayne@68: >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): jpayne@68: ... print(format_alignment(*a)) jpayne@68: 1 ACCG jpayne@68: | || jpayne@68: 1 A-CG jpayne@68: Score=3 jpayne@68: jpayne@68: 1 ACCG jpayne@68: || | jpayne@68: 1 AC-G jpayne@68: Score=3 jpayne@68: jpayne@68: jpayne@68: To restore the 'historic' behaviour of ``format_alignemt``, i.e., showing jpayne@68: also the un-aligned parts of both sequences, use the new keyword parameter jpayne@68: ``full_sequences``: jpayne@68: jpayne@68: >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): jpayne@68: ... print(format_alignment(*a, full_sequences=True)) jpayne@68: ACCGT jpayne@68: | || jpayne@68: A-CG- jpayne@68: Score=3 jpayne@68: jpayne@68: ACCGT jpayne@68: || | jpayne@68: AC-G- jpayne@68: Score=3 jpayne@68: jpayne@68: jpayne@68: jpayne@68: - Do a global alignment. Identical characters are given 2 points, 1 point is jpayne@68: deducted for each non-identical character. Don't penalize gaps. jpayne@68: jpayne@68: >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1): jpayne@68: ... print(format_alignment(*a)) jpayne@68: ACCGT jpayne@68: | || jpayne@68: A-CG- jpayne@68: Score=6 jpayne@68: jpayne@68: ACCGT jpayne@68: || | jpayne@68: AC-G- jpayne@68: Score=6 jpayne@68: jpayne@68: jpayne@68: - Same as above, except now 0.5 points are deducted when opening a gap, and jpayne@68: 0.1 points are deducted when extending it. jpayne@68: jpayne@68: >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1): jpayne@68: ... print(format_alignment(*a)) jpayne@68: ACCGT jpayne@68: | || jpayne@68: A-CG- jpayne@68: Score=5 jpayne@68: jpayne@68: ACCGT jpayne@68: || | jpayne@68: AC-G- jpayne@68: Score=5 jpayne@68: jpayne@68: jpayne@68: - Note that you can use keywords to increase the readability, e.g.: jpayne@68: jpayne@68: >>> a = pairwise2.align.globalms("ACGT", "ACG", match=2, mismatch=-1, open=-.5, jpayne@68: ... extend=-.1) jpayne@68: jpayne@68: - Depending on the penalties, a gap in one sequence may be followed by a gap in jpayne@68: the other sequence.If you don't like this behaviour, increase the gap-open jpayne@68: penalty: jpayne@68: jpayne@68: >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1): jpayne@68: ... print(format_alignment(*a)) jpayne@68: A- jpayne@68: jpayne@68: -T jpayne@68: Score=-2 jpayne@68: jpayne@68: >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1): jpayne@68: ... print(format_alignment(*a)) jpayne@68: A jpayne@68: . jpayne@68: T jpayne@68: Score=-4 jpayne@68: jpayne@68: jpayne@68: - The alignment function can also use known matrices already included in jpayne@68: Biopython (in ``Bio.Align.substitution_matrices``): jpayne@68: jpayne@68: >>> from Bio.Align import substitution_matrices jpayne@68: >>> matrix = substitution_matrices.load("BLOSUM62") jpayne@68: >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix): jpayne@68: ... print(format_alignment(*a)) jpayne@68: KEVLA jpayne@68: ||| jpayne@68: -EVL- jpayne@68: Score=13 jpayne@68: jpayne@68: jpayne@68: - With the parameter ``c`` you can define your own match- and gap functions. jpayne@68: E.g. to define an affine logarithmic gap function and using it: jpayne@68: jpayne@68: >>> from math import log jpayne@68: >>> def gap_function(x, y): # x is gap position in seq, y is gap length jpayne@68: ... if y == 0: # No gap jpayne@68: ... return 0 jpayne@68: ... elif y == 1: # Gap open penalty jpayne@68: ... return -2 jpayne@68: ... return - (2 + y/4.0 + log(y)/2.0) jpayne@68: ... jpayne@68: >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4, jpayne@68: ... gap_function, gap_function) jpayne@68: jpayne@68: You can define different gap functions for each sequence. jpayne@68: Self-defined match functions must take the two residues to be compared and jpayne@68: return a score. jpayne@68: jpayne@68: To see a description of the parameters for a function, please look at jpayne@68: the docstring for the function via the help function, e.g. jpayne@68: type ``help(pairwise2.align.localds)`` at the Python prompt. jpayne@68: jpayne@68: """ # noqa: W291 jpayne@68: jpayne@68: import warnings jpayne@68: from collections import namedtuple jpayne@68: jpayne@68: from Bio import BiopythonWarning jpayne@68: from Bio import BiopythonDeprecationWarning jpayne@68: from Bio.Align import substitution_matrices jpayne@68: jpayne@68: warnings.warn( jpayne@68: "Bio.pairwise2 has been deprecated, and we intend to remove it in a " jpayne@68: "future release of Biopython. As an alternative, please consider using " jpayne@68: "Bio.Align.PairwiseAligner as a replacement, and contact the " jpayne@68: "Biopython developers if you still need the Bio.pairwise2 module.", jpayne@68: BiopythonDeprecationWarning, jpayne@68: ) jpayne@68: jpayne@68: jpayne@68: MAX_ALIGNMENTS = 1000 # maximum alignments recovered in traceback jpayne@68: jpayne@68: Alignment = namedtuple("Alignment", ("seqA, seqB, score, start, end")) jpayne@68: jpayne@68: jpayne@68: class align: jpayne@68: """Provide functions that do alignments. jpayne@68: jpayne@68: Alignment functions are called as: jpayne@68: jpayne@68: pairwise2.align.globalXX jpayne@68: jpayne@68: or jpayne@68: jpayne@68: pairwise2.align.localXX jpayne@68: jpayne@68: Where XX is a 2 character code indicating the match/mismatch parameters jpayne@68: (first character, either x, m, d or c) and the gap penalty parameters jpayne@68: (second character, either x, s, d, or c). jpayne@68: jpayne@68: For a detailed description read the main module's docstring (e.g., jpayne@68: type ``help(pairwise2)``). jpayne@68: To see a description of the parameters for a function, please jpayne@68: look at the docstring for the function, e.g. type jpayne@68: ``help(pairwise2.align.localds)`` at the Python prompt. jpayne@68: """ jpayne@68: jpayne@68: class alignment_function: jpayne@68: """Callable class which impersonates an alignment function. jpayne@68: jpayne@68: The constructor takes the name of the function. This class jpayne@68: will decode the name of the function to figure out how to jpayne@68: interpret the parameters. jpayne@68: """ jpayne@68: jpayne@68: # match code -> tuple of (parameters, docstring) jpayne@68: match2args = { jpayne@68: "x": ([], ""), jpayne@68: "m": ( jpayne@68: ["match", "mismatch"], jpayne@68: "match is the score to given to identical characters.\n" jpayne@68: "mismatch is the score given to non-identical ones.", jpayne@68: ), jpayne@68: "d": ( jpayne@68: ["match_dict"], jpayne@68: "match_dict is a dictionary where the keys are tuples\n" jpayne@68: "of pairs of characters and the values are the scores,\n" jpayne@68: "e.g. ('A', 'C') : 2.5.", jpayne@68: ), jpayne@68: "c": ( jpayne@68: ["match_fn"], jpayne@68: "match_fn is a callback function that takes two " jpayne@68: "characters and returns the score between them.", jpayne@68: ), jpayne@68: } jpayne@68: # penalty code -> tuple of (parameters, docstring) jpayne@68: penalty2args = { jpayne@68: "x": ([], ""), jpayne@68: "s": ( jpayne@68: ["open", "extend"], jpayne@68: "open and extend are the gap penalties when a gap is\n" jpayne@68: "opened and extended. They should be negative.", jpayne@68: ), jpayne@68: "d": ( jpayne@68: ["openA", "extendA", "openB", "extendB"], jpayne@68: "openA and extendA are the gap penalties for sequenceA,\n" jpayne@68: "and openB and extendB for sequenceB. The penalties\n" jpayne@68: "should be negative.", jpayne@68: ), jpayne@68: "c": ( jpayne@68: ["gap_A_fn", "gap_B_fn"], jpayne@68: "gap_A_fn and gap_B_fn are callback functions that takes\n" jpayne@68: "(1) the index where the gap is opened, and (2) the length\n" jpayne@68: "of the gap. They should return a gap penalty.", jpayne@68: ), jpayne@68: } jpayne@68: jpayne@68: def __init__(self, name): jpayne@68: """Check to make sure the name of the function is reasonable.""" jpayne@68: if name.startswith("global"): jpayne@68: if len(name) != 8: jpayne@68: raise AttributeError("function should be globalXX") jpayne@68: elif name.startswith("local"): jpayne@68: if len(name) != 7: jpayne@68: raise AttributeError("function should be localXX") jpayne@68: else: jpayne@68: raise AttributeError(name) jpayne@68: align_type, match_type, penalty_type = name[:-2], name[-2], name[-1] jpayne@68: try: jpayne@68: match_args, match_doc = self.match2args[match_type] jpayne@68: except KeyError: jpayne@68: raise AttributeError(f"unknown match type {match_type!r}") jpayne@68: try: jpayne@68: penalty_args, penalty_doc = self.penalty2args[penalty_type] jpayne@68: except KeyError: jpayne@68: raise AttributeError(f"unknown penalty type {penalty_type!r}") jpayne@68: jpayne@68: # Now get the names of the parameters to this function. jpayne@68: param_names = ["sequenceA", "sequenceB"] jpayne@68: param_names.extend(match_args) jpayne@68: param_names.extend(penalty_args) jpayne@68: self.function_name = name jpayne@68: self.align_type = align_type jpayne@68: self.param_names = param_names jpayne@68: jpayne@68: self.__name__ = self.function_name jpayne@68: # Set the doc string. jpayne@68: doc = f"{self.__name__}({', '.join(self.param_names)}) -> alignments\n" jpayne@68: doc += """\ jpayne@68: \nThe following parameters can also be used with optional jpayne@68: keywords of the same name.\n\n jpayne@68: sequenceA and sequenceB must be of the same type, either jpayne@68: strings, lists or Biopython sequence objects.\n jpayne@68: """ jpayne@68: if match_doc: jpayne@68: doc += f"\n{match_doc}\n" jpayne@68: if penalty_doc: jpayne@68: doc += f"\n{penalty_doc}\n" jpayne@68: doc += """\ jpayne@68: \nalignments is a list of named tuples (seqA, seqB, score, jpayne@68: begin, end). seqA and seqB are strings showing the alignment jpayne@68: between the sequences. score is the score of the alignment. jpayne@68: begin and end are indexes of seqA and seqB that indicate jpayne@68: where the alignment occurs. jpayne@68: """ jpayne@68: self.__doc__ = doc jpayne@68: jpayne@68: def decode(self, *args, **keywds): jpayne@68: """Decode the arguments for the _align function. jpayne@68: jpayne@68: keywds will get passed to it, so translate the arguments jpayne@68: to this function into forms appropriate for _align. jpayne@68: """ jpayne@68: keywds = keywds.copy() jpayne@68: jpayne@68: # Replace possible "keywords" with arguments: jpayne@68: args += (len(self.param_names) - len(args)) * (None,) jpayne@68: for key in keywds.copy(): jpayne@68: if key in self.param_names: jpayne@68: _index = self.param_names.index(key) jpayne@68: args = args[:_index] + (keywds[key],) + args[_index:] jpayne@68: del keywds[key] jpayne@68: args = tuple(arg for arg in args if arg is not None) jpayne@68: jpayne@68: if len(args) != len(self.param_names): jpayne@68: raise TypeError( jpayne@68: "%s takes exactly %d argument (%d given)" jpayne@68: % (self.function_name, len(self.param_names), len(args)) jpayne@68: ) jpayne@68: jpayne@68: i = 0 jpayne@68: while i < len(self.param_names): jpayne@68: if self.param_names[i] in [ jpayne@68: "sequenceA", jpayne@68: "sequenceB", jpayne@68: "gap_A_fn", jpayne@68: "gap_B_fn", jpayne@68: "match_fn", jpayne@68: ]: jpayne@68: keywds[self.param_names[i]] = args[i] jpayne@68: i += 1 jpayne@68: elif self.param_names[i] == "match": jpayne@68: assert self.param_names[i + 1] == "mismatch" jpayne@68: match, mismatch = args[i], args[i + 1] jpayne@68: keywds["match_fn"] = identity_match(match, mismatch) jpayne@68: i += 2 jpayne@68: elif self.param_names[i] == "match_dict": jpayne@68: keywds["match_fn"] = dictionary_match(args[i]) jpayne@68: i += 1 jpayne@68: elif self.param_names[i] == "open": jpayne@68: assert self.param_names[i + 1] == "extend" jpayne@68: open, extend = args[i], args[i + 1] jpayne@68: pe = keywds.get("penalize_extend_when_opening", 0) jpayne@68: keywds["gap_A_fn"] = affine_penalty(open, extend, pe) jpayne@68: keywds["gap_B_fn"] = affine_penalty(open, extend, pe) jpayne@68: i += 2 jpayne@68: elif self.param_names[i] == "openA": jpayne@68: assert self.param_names[i + 3] == "extendB" jpayne@68: openA, extendA, openB, extendB = args[i : i + 4] jpayne@68: pe = keywds.get("penalize_extend_when_opening", 0) jpayne@68: keywds["gap_A_fn"] = affine_penalty(openA, extendA, pe) jpayne@68: keywds["gap_B_fn"] = affine_penalty(openB, extendB, pe) jpayne@68: i += 4 jpayne@68: else: jpayne@68: raise ValueError(f"unknown parameter {self.param_names[i]!r}") jpayne@68: jpayne@68: # Here are the default parameters for _align. Assign jpayne@68: # these to keywds, unless already specified. jpayne@68: pe = keywds.get("penalize_extend_when_opening", 0) jpayne@68: default_params = [ jpayne@68: ("match_fn", identity_match(1, 0)), jpayne@68: ("gap_A_fn", affine_penalty(0, 0, pe)), jpayne@68: ("gap_B_fn", affine_penalty(0, 0, pe)), jpayne@68: ("penalize_extend_when_opening", 0), jpayne@68: ("penalize_end_gaps", self.align_type == "global"), jpayne@68: ("align_globally", self.align_type == "global"), jpayne@68: ("gap_char", "-"), jpayne@68: ("force_generic", 0), jpayne@68: ("score_only", 0), jpayne@68: ("one_alignment_only", 0), jpayne@68: ] jpayne@68: for name, default in default_params: jpayne@68: keywds[name] = keywds.get(name, default) jpayne@68: value = keywds["penalize_end_gaps"] jpayne@68: try: jpayne@68: n = len(value) jpayne@68: except TypeError: jpayne@68: keywds["penalize_end_gaps"] = tuple([value] * 2) jpayne@68: else: jpayne@68: assert n == 2 jpayne@68: return keywds jpayne@68: jpayne@68: def __call__(self, *args, **keywds): jpayne@68: """Call the alignment instance already created.""" jpayne@68: keywds = self.decode(*args, **keywds) jpayne@68: return _align(**keywds) jpayne@68: jpayne@68: def __getattr__(self, attr): jpayne@68: """Call alignment_function() to check and decode the attributes.""" jpayne@68: # The following 'magic' is needed to rewrite the class docstring jpayne@68: # dynamically: jpayne@68: wrapper = self.alignment_function(attr) jpayne@68: wrapper_type = type(wrapper) jpayne@68: wrapper_dict = wrapper_type.__dict__.copy() jpayne@68: wrapper_dict["__doc__"] = wrapper.__doc__ jpayne@68: new_alignment_function = type("alignment_function", (object,), wrapper_dict) jpayne@68: jpayne@68: return new_alignment_function(attr) jpayne@68: jpayne@68: jpayne@68: align = align() jpayne@68: jpayne@68: jpayne@68: def _align( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: match_fn, jpayne@68: gap_A_fn, jpayne@68: gap_B_fn, jpayne@68: penalize_extend_when_opening, jpayne@68: penalize_end_gaps, jpayne@68: align_globally, jpayne@68: gap_char, jpayne@68: force_generic, jpayne@68: score_only, jpayne@68: one_alignment_only, jpayne@68: ): jpayne@68: """Return optimal alignments between two sequences (PRIVATE). jpayne@68: jpayne@68: This method either returns a list of optimal alignments (with the same jpayne@68: score) or just the optimal score. jpayne@68: """ jpayne@68: if not sequenceA or not sequenceB: jpayne@68: return [] jpayne@68: try: jpayne@68: sequenceA + gap_char jpayne@68: sequenceB + gap_char jpayne@68: except TypeError: jpayne@68: raise TypeError( jpayne@68: "both sequences must be of the same type, either " jpayne@68: "string/sequence object or list. Gap character must " jpayne@68: "fit the sequence type (string or list)" jpayne@68: ) jpayne@68: jpayne@68: if not isinstance(sequenceA, list): jpayne@68: sequenceA = str(sequenceA) jpayne@68: if not isinstance(sequenceB, list): jpayne@68: sequenceB = str(sequenceB) jpayne@68: if not align_globally and (penalize_end_gaps[0] or penalize_end_gaps[1]): jpayne@68: warnings.warn( jpayne@68: '"penalize_end_gaps" should not be used in local ' jpayne@68: "alignments. The resulting score may be wrong.", jpayne@68: BiopythonWarning, jpayne@68: ) jpayne@68: jpayne@68: if ( jpayne@68: (not force_generic) jpayne@68: and isinstance(gap_A_fn, affine_penalty) jpayne@68: and isinstance(gap_B_fn, affine_penalty) jpayne@68: ): jpayne@68: open_A, extend_A = gap_A_fn.open, gap_A_fn.extend jpayne@68: open_B, extend_B = gap_B_fn.open, gap_B_fn.extend jpayne@68: matrices = _make_score_matrix_fast( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: match_fn, jpayne@68: open_A, jpayne@68: extend_A, jpayne@68: open_B, jpayne@68: extend_B, jpayne@68: penalize_extend_when_opening, jpayne@68: penalize_end_gaps, jpayne@68: align_globally, jpayne@68: score_only, jpayne@68: ) jpayne@68: else: jpayne@68: matrices = _make_score_matrix_generic( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: match_fn, jpayne@68: gap_A_fn, jpayne@68: gap_B_fn, jpayne@68: penalize_end_gaps, jpayne@68: align_globally, jpayne@68: score_only, jpayne@68: ) jpayne@68: jpayne@68: score_matrix, trace_matrix, best_score = matrices jpayne@68: jpayne@68: # print("SCORE %s" % print_matrix(score_matrix)) jpayne@68: # print("TRACEBACK %s" % print_matrix(trace_matrix)) jpayne@68: jpayne@68: # If they only want the score, then return it. jpayne@68: if score_only: jpayne@68: return best_score jpayne@68: jpayne@68: starts = _find_start(score_matrix, best_score, align_globally) jpayne@68: jpayne@68: # Recover the alignments and return them. jpayne@68: alignments = _recover_alignments( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: starts, jpayne@68: best_score, jpayne@68: score_matrix, jpayne@68: trace_matrix, jpayne@68: align_globally, jpayne@68: gap_char, jpayne@68: one_alignment_only, jpayne@68: gap_A_fn, jpayne@68: gap_B_fn, jpayne@68: ) jpayne@68: if not alignments: jpayne@68: # This may happen, see recover_alignments for explanation jpayne@68: score_matrix, trace_matrix = _reverse_matrices(score_matrix, trace_matrix) jpayne@68: starts = [(z, (y, x)) for z, (x, y) in starts] jpayne@68: alignments = _recover_alignments( jpayne@68: sequenceB, jpayne@68: sequenceA, jpayne@68: starts, jpayne@68: best_score, jpayne@68: score_matrix, jpayne@68: trace_matrix, jpayne@68: align_globally, jpayne@68: gap_char, jpayne@68: one_alignment_only, jpayne@68: gap_B_fn, jpayne@68: gap_A_fn, jpayne@68: reverse=True, jpayne@68: ) jpayne@68: return alignments jpayne@68: jpayne@68: jpayne@68: def _make_score_matrix_generic( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: match_fn, jpayne@68: gap_A_fn, jpayne@68: gap_B_fn, jpayne@68: penalize_end_gaps, jpayne@68: align_globally, jpayne@68: score_only, jpayne@68: ): jpayne@68: """Generate a score and traceback matrix (PRIVATE). jpayne@68: jpayne@68: This implementation according to Needleman-Wunsch allows the usage of jpayne@68: general gap functions and is rather slow. It is automatically called if jpayne@68: you define your own gap functions. You can force the usage of this method jpayne@68: with ``force_generic=True``. jpayne@68: """ jpayne@68: local_max_score = 0 jpayne@68: # Create the score and traceback matrices. These should be in the jpayne@68: # shape: jpayne@68: # sequenceA (down) x sequenceB (across) jpayne@68: lenA, lenB = len(sequenceA), len(sequenceB) jpayne@68: score_matrix, trace_matrix = [], [] jpayne@68: for i in range(lenA + 1): jpayne@68: score_matrix.append([None] * (lenB + 1)) jpayne@68: if not score_only: jpayne@68: trace_matrix.append([None] * (lenB + 1)) jpayne@68: jpayne@68: # Initialize first row and column with gap scores. This is like opening up jpayne@68: # i gaps at the beginning of sequence A or B. jpayne@68: for i in range(lenA + 1): jpayne@68: if penalize_end_gaps[1]: # [1]:gap in sequence B jpayne@68: score = gap_B_fn(0, i) jpayne@68: else: jpayne@68: score = 0.0 jpayne@68: score_matrix[i][0] = score jpayne@68: for i in range(lenB + 1): jpayne@68: if penalize_end_gaps[0]: # [0]:gap in sequence A jpayne@68: score = gap_A_fn(0, i) jpayne@68: else: jpayne@68: score = 0.0 jpayne@68: score_matrix[0][i] = score jpayne@68: jpayne@68: # Fill in the score matrix. Each position in the matrix jpayne@68: # represents an alignment between a character from sequence A to jpayne@68: # one in sequence B. As I iterate through the matrix, find the jpayne@68: # alignment by choose the best of: jpayne@68: # 1) extending a previous alignment without gaps jpayne@68: # 2) adding a gap in sequenceA jpayne@68: # 3) adding a gap in sequenceB jpayne@68: for row in range(1, lenA + 1): jpayne@68: for col in range(1, lenB + 1): jpayne@68: # First, calculate the score that would occur by extending jpayne@68: # the alignment without gaps. jpayne@68: # fmt: off jpayne@68: nogap_score = ( jpayne@68: score_matrix[row - 1][col - 1] jpayne@68: + match_fn(sequenceA[row - 1], sequenceB[col - 1]) jpayne@68: ) jpayne@68: jpayne@68: # fmt: on jpayne@68: # Try to find a better score by opening gaps in sequenceA. jpayne@68: # Do this by checking alignments from each column in the row. jpayne@68: # Each column represents a different character to align from, jpayne@68: # and thus a different length gap. jpayne@68: # Although the gap function does not distinguish between opening jpayne@68: # and extending a gap, we distinguish them for the backtrace. jpayne@68: if not penalize_end_gaps[0] and row == lenA: jpayne@68: row_open = score_matrix[row][col - 1] jpayne@68: row_extend = max(score_matrix[row][x] for x in range(col)) jpayne@68: else: jpayne@68: row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1) jpayne@68: row_extend = max( jpayne@68: score_matrix[row][x] + gap_A_fn(row, col - x) for x in range(col) jpayne@68: ) jpayne@68: jpayne@68: # Try to find a better score by opening gaps in sequenceB. jpayne@68: if not penalize_end_gaps[1] and col == lenB: jpayne@68: col_open = score_matrix[row - 1][col] jpayne@68: col_extend = max(score_matrix[x][col] for x in range(row)) jpayne@68: else: jpayne@68: col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1) jpayne@68: col_extend = max( jpayne@68: score_matrix[x][col] + gap_B_fn(col, row - x) for x in range(row) jpayne@68: ) jpayne@68: jpayne@68: best_score = max(nogap_score, row_open, row_extend, col_open, col_extend) jpayne@68: local_max_score = max(local_max_score, best_score) jpayne@68: if not align_globally and best_score < 0: jpayne@68: score_matrix[row][col] = 0.0 jpayne@68: else: jpayne@68: score_matrix[row][col] = best_score jpayne@68: jpayne@68: # The backtrace is encoded binary. See _make_score_matrix_fast jpayne@68: # for details. jpayne@68: if not score_only: jpayne@68: trace_score = 0 jpayne@68: if rint(nogap_score) == rint(best_score): jpayne@68: trace_score += 2 jpayne@68: if rint(row_open) == rint(best_score): jpayne@68: trace_score += 1 jpayne@68: if rint(row_extend) == rint(best_score): jpayne@68: trace_score += 8 jpayne@68: if rint(col_open) == rint(best_score): jpayne@68: trace_score += 4 jpayne@68: if rint(col_extend) == rint(best_score): jpayne@68: trace_score += 16 jpayne@68: trace_matrix[row][col] = trace_score jpayne@68: jpayne@68: if not align_globally: jpayne@68: best_score = local_max_score jpayne@68: jpayne@68: return score_matrix, trace_matrix, best_score jpayne@68: jpayne@68: jpayne@68: def _make_score_matrix_fast( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: match_fn, jpayne@68: open_A, jpayne@68: extend_A, jpayne@68: open_B, jpayne@68: extend_B, jpayne@68: penalize_extend_when_opening, jpayne@68: penalize_end_gaps, jpayne@68: align_globally, jpayne@68: score_only, jpayne@68: ): jpayne@68: """Generate a score and traceback matrix according to Gotoh (PRIVATE). jpayne@68: jpayne@68: This is an implementation of the Needleman-Wunsch dynamic programming jpayne@68: algorithm as modified by Gotoh, implementing affine gap penalties. jpayne@68: In short, we have three matrices, holding scores for alignments ending jpayne@68: in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in jpayne@68: sequence B, respectively. However, we can combine them in one matrix, jpayne@68: which holds the best scores, and store only those values from the jpayne@68: other matrices that are actually used for the next step of calculation. jpayne@68: The traceback matrix holds the positions for backtracing the alignment. jpayne@68: """ jpayne@68: first_A_gap = calc_affine_penalty(1, open_A, extend_A, penalize_extend_when_opening) jpayne@68: first_B_gap = calc_affine_penalty(1, open_B, extend_B, penalize_extend_when_opening) jpayne@68: local_max_score = 0 jpayne@68: jpayne@68: # Create the score and traceback matrices. These should be in the jpayne@68: # shape: jpayne@68: # sequenceA (down) x sequenceB (across) jpayne@68: lenA, lenB = len(sequenceA), len(sequenceB) jpayne@68: score_matrix, trace_matrix = [], [] jpayne@68: for i in range(lenA + 1): jpayne@68: score_matrix.append([None] * (lenB + 1)) jpayne@68: if not score_only: jpayne@68: trace_matrix.append([None] * (lenB + 1)) jpayne@68: jpayne@68: # Initialize first row and column with gap scores. This is like opening up jpayne@68: # i gaps at the beginning of sequence A or B. jpayne@68: for i in range(lenA + 1): jpayne@68: if penalize_end_gaps[1]: # [1]:gap in sequence B jpayne@68: score = calc_affine_penalty( jpayne@68: i, open_B, extend_B, penalize_extend_when_opening jpayne@68: ) jpayne@68: else: jpayne@68: score = 0 jpayne@68: score_matrix[i][0] = score jpayne@68: for i in range(lenB + 1): jpayne@68: if penalize_end_gaps[0]: # [0]:gap in sequence A jpayne@68: score = calc_affine_penalty( jpayne@68: i, open_A, extend_A, penalize_extend_when_opening jpayne@68: ) jpayne@68: else: jpayne@68: score = 0 jpayne@68: score_matrix[0][i] = score jpayne@68: jpayne@68: # Now initialize the col 'matrix'. Actually this is only a one dimensional jpayne@68: # list, since we only need the col scores from the last row. jpayne@68: col_score = [0] # Best score, if actual alignment ends with gap in seqB jpayne@68: for i in range(1, lenB + 1): jpayne@68: col_score.append( jpayne@68: calc_affine_penalty(i, 2 * open_B, extend_B, penalize_extend_when_opening) jpayne@68: ) jpayne@68: jpayne@68: # The row 'matrix' is calculated on the fly. Here we only need the actual jpayne@68: # score. jpayne@68: # Now, filling up the score and traceback matrices: jpayne@68: for row in range(1, lenA + 1): jpayne@68: row_score = calc_affine_penalty( jpayne@68: row, 2 * open_A, extend_A, penalize_extend_when_opening jpayne@68: ) jpayne@68: for col in range(1, lenB + 1): jpayne@68: # Calculate the score that would occur by extending the jpayne@68: # alignment without gaps. jpayne@68: # fmt: off jpayne@68: nogap_score = ( jpayne@68: score_matrix[row - 1][col - 1] jpayne@68: + match_fn(sequenceA[row - 1], sequenceB[col - 1]) jpayne@68: ) jpayne@68: # fmt: on jpayne@68: # Check the score that would occur if there were a gap in jpayne@68: # sequence A. This could come from opening a new gap or jpayne@68: # extending an existing one. jpayne@68: # A gap in sequence A can also be opened if it follows a gap in jpayne@68: # sequence B: A- jpayne@68: # -B jpayne@68: if not penalize_end_gaps[0] and row == lenA: jpayne@68: row_open = score_matrix[row][col - 1] jpayne@68: row_extend = row_score jpayne@68: else: jpayne@68: row_open = score_matrix[row][col - 1] + first_A_gap jpayne@68: row_extend = row_score + extend_A jpayne@68: row_score = max(row_open, row_extend) jpayne@68: jpayne@68: # The same for sequence B: jpayne@68: if not penalize_end_gaps[1] and col == lenB: jpayne@68: col_open = score_matrix[row - 1][col] jpayne@68: col_extend = col_score[col] jpayne@68: else: jpayne@68: col_open = score_matrix[row - 1][col] + first_B_gap jpayne@68: col_extend = col_score[col] + extend_B jpayne@68: col_score[col] = max(col_open, col_extend) jpayne@68: jpayne@68: best_score = max(nogap_score, col_score[col], row_score) jpayne@68: local_max_score = max(local_max_score, best_score) jpayne@68: if not align_globally and best_score < 0: jpayne@68: score_matrix[row][col] = 0 jpayne@68: else: jpayne@68: score_matrix[row][col] = best_score jpayne@68: jpayne@68: # Now the trace_matrix. The edges of the backtrace are encoded jpayne@68: # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and jpayne@68: # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and jpayne@68: # 16 = extend gap in seqB. This values can be summed up. jpayne@68: # Thus, the trace score 7 means that the best score can either jpayne@68: # come from opening a gap in seqA (=1), pairing two characters jpayne@68: # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7). jpayne@68: # However, if we only want the score we don't care about the trace. jpayne@68: if not score_only: jpayne@68: row_score_rint = rint(row_score) jpayne@68: col_score_rint = rint(col_score[col]) jpayne@68: row_trace_score = 0 jpayne@68: col_trace_score = 0 jpayne@68: if rint(row_open) == row_score_rint: jpayne@68: row_trace_score += 1 # Open gap in seqA jpayne@68: if rint(row_extend) == row_score_rint: jpayne@68: row_trace_score += 8 # Extend gap in seqA jpayne@68: if rint(col_open) == col_score_rint: jpayne@68: col_trace_score += 4 # Open gap in seqB jpayne@68: if rint(col_extend) == col_score_rint: jpayne@68: col_trace_score += 16 # Extend gap in seqB jpayne@68: jpayne@68: trace_score = 0 jpayne@68: best_score_rint = rint(best_score) jpayne@68: if rint(nogap_score) == best_score_rint: jpayne@68: trace_score += 2 # Align seqA with seqB jpayne@68: if row_score_rint == best_score_rint: jpayne@68: trace_score += row_trace_score jpayne@68: if col_score_rint == best_score_rint: jpayne@68: trace_score += col_trace_score jpayne@68: trace_matrix[row][col] = trace_score jpayne@68: jpayne@68: if not align_globally: jpayne@68: best_score = local_max_score jpayne@68: jpayne@68: return score_matrix, trace_matrix, best_score jpayne@68: jpayne@68: jpayne@68: def _recover_alignments( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: starts, jpayne@68: best_score, jpayne@68: score_matrix, jpayne@68: trace_matrix, jpayne@68: align_globally, jpayne@68: gap_char, jpayne@68: one_alignment_only, jpayne@68: gap_A_fn, jpayne@68: gap_B_fn, jpayne@68: reverse=False, jpayne@68: ): jpayne@68: """Do the backtracing and return a list of alignments (PRIVATE). jpayne@68: jpayne@68: Recover the alignments by following the traceback matrix. This jpayne@68: is a recursive procedure, but it's implemented here iteratively jpayne@68: with a stack. jpayne@68: jpayne@68: sequenceA and sequenceB may be sequences, including strings, jpayne@68: lists, or list-like objects. In order to preserve the type of jpayne@68: the object, we need to use slices on the sequences instead of jpayne@68: indexes. For example, sequenceA[row] may return a type that's jpayne@68: not compatible with sequenceA, e.g. if sequenceA is a list and jpayne@68: sequenceA[row] is a string. Thus, avoid using indexes and use jpayne@68: slices, e.g. sequenceA[row:row+1]. Assume that client-defined jpayne@68: sequence classes preserve these semantics. jpayne@68: """ jpayne@68: lenA, lenB = len(sequenceA), len(sequenceB) jpayne@68: ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0] jpayne@68: tracebacks = [] jpayne@68: in_process = [] jpayne@68: jpayne@68: for start in starts: jpayne@68: score, (row, col) = start jpayne@68: begin = 0 jpayne@68: if align_globally: jpayne@68: end = None jpayne@68: else: jpayne@68: # If this start is a zero-extension: don't start here! jpayne@68: if (score, (row - 1, col - 1)) in starts: jpayne@68: continue jpayne@68: # Local alignments should start with a positive score! jpayne@68: if score <= 0: jpayne@68: continue jpayne@68: # Local alignments should not end with a gap!: jpayne@68: trace = trace_matrix[row][col] jpayne@68: if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine! jpayne@68: trace_matrix[row][col] = 2 jpayne@68: # If not, don't start here! jpayne@68: else: jpayne@68: continue jpayne@68: end = -max(lenA - row, lenB - col) jpayne@68: if not end: jpayne@68: end = None jpayne@68: col_distance = lenB - col jpayne@68: row_distance = lenA - row jpayne@68: jpayne@68: # fmt: off jpayne@68: ali_seqA = ( jpayne@68: (col_distance - row_distance) * gap_char jpayne@68: + sequenceA[lenA - 1 : row - 1 : -1] jpayne@68: ) jpayne@68: ali_seqB = ( jpayne@68: (row_distance - col_distance) * gap_char jpayne@68: + sequenceB[lenB - 1 : col - 1 : -1] jpayne@68: ) jpayne@68: # fmt: on jpayne@68: in_process += [ jpayne@68: (ali_seqA, ali_seqB, end, row, col, False, trace_matrix[row][col]) jpayne@68: ] jpayne@68: while in_process and len(tracebacks) < MAX_ALIGNMENTS: jpayne@68: # Although we allow a gap in seqB to be followed by a gap in seqA, jpayne@68: # we don't want to allow it the other way round, since this would jpayne@68: # give redundant alignments of type: A- vs. -A jpayne@68: # -B B- jpayne@68: # Thus we need to keep track if a gap in seqA was opened (col_gap) jpayne@68: # and stop the backtrace (dead_end) if a gap in seqB follows. jpayne@68: # jpayne@68: # Attention: This may fail, if the gap-penalties for both strands are jpayne@68: # different. In this case the second alignment may be the only optimal jpayne@68: # alignment. Thus it can happen that no alignment is returned. For jpayne@68: # this case a workaround was implemented, which reverses the input and jpayne@68: # the matrices (this happens in _reverse_matrices) and repeats the jpayne@68: # backtrace. The variable 'reverse' keeps track of this. jpayne@68: dead_end = False jpayne@68: ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop() jpayne@68: jpayne@68: while (row > 0 or col > 0) and not dead_end: jpayne@68: cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap) jpayne@68: jpayne@68: # If trace is empty we have reached at least one border of the jpayne@68: # matrix or the end of a local alignment. Just add the rest of jpayne@68: # the sequence(s) and fill with gaps if necessary. jpayne@68: if not trace: jpayne@68: if col and col_gap: jpayne@68: dead_end = True jpayne@68: else: jpayne@68: ali_seqA, ali_seqB = _finish_backtrace( jpayne@68: sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, gap_char jpayne@68: ) jpayne@68: break jpayne@68: elif trace % 2 == 1: # = row open = open gap in seqA jpayne@68: trace -= 1 jpayne@68: if col_gap: jpayne@68: dead_end = True jpayne@68: else: jpayne@68: col -= 1 jpayne@68: ali_seqA += gap_char jpayne@68: ali_seqB += sequenceB[col : col + 1] jpayne@68: col_gap = False jpayne@68: elif trace % 4 == 2: # = match/mismatch of seqA with seqB jpayne@68: trace -= 2 jpayne@68: row -= 1 jpayne@68: col -= 1 jpayne@68: ali_seqA += sequenceA[row : row + 1] jpayne@68: ali_seqB += sequenceB[col : col + 1] jpayne@68: col_gap = False jpayne@68: elif trace % 8 == 4: # = col open = open gap in seqB jpayne@68: trace -= 4 jpayne@68: row -= 1 jpayne@68: ali_seqA += sequenceA[row : row + 1] jpayne@68: ali_seqB += gap_char jpayne@68: col_gap = True jpayne@68: elif trace in (8, 24): # = row extend = extend gap in seqA jpayne@68: trace -= 8 jpayne@68: if col_gap: jpayne@68: dead_end = True jpayne@68: else: jpayne@68: col_gap = False jpayne@68: # We need to find the starting point of the extended gap jpayne@68: x = _find_gap_open( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: ali_seqA, jpayne@68: ali_seqB, jpayne@68: end, jpayne@68: row, jpayne@68: col, jpayne@68: col_gap, jpayne@68: gap_char, jpayne@68: score_matrix, jpayne@68: trace_matrix, jpayne@68: in_process, jpayne@68: gap_A_fn, jpayne@68: col, jpayne@68: row, jpayne@68: "col", jpayne@68: best_score, jpayne@68: align_globally, jpayne@68: ) jpayne@68: ali_seqA, ali_seqB, row, col, in_process, dead_end = x jpayne@68: elif trace == 16: # = col extend = extend gap in seqB jpayne@68: trace -= 16 jpayne@68: col_gap = True jpayne@68: x = _find_gap_open( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: ali_seqA, jpayne@68: ali_seqB, jpayne@68: end, jpayne@68: row, jpayne@68: col, jpayne@68: col_gap, jpayne@68: gap_char, jpayne@68: score_matrix, jpayne@68: trace_matrix, jpayne@68: in_process, jpayne@68: gap_B_fn, jpayne@68: row, jpayne@68: col, jpayne@68: "row", jpayne@68: best_score, jpayne@68: align_globally, jpayne@68: ) jpayne@68: ali_seqA, ali_seqB, row, col, in_process, dead_end = x jpayne@68: jpayne@68: if trace: # There is another path to follow... jpayne@68: cache += (trace,) jpayne@68: in_process.append(cache) jpayne@68: trace = trace_matrix[row][col] jpayne@68: if not align_globally: jpayne@68: if score_matrix[row][col] == best_score: jpayne@68: # We have gone through a 'zero-score' extension, discard it jpayne@68: dead_end = True jpayne@68: elif score_matrix[row][col] <= 0: jpayne@68: # We have reached the end of the backtrace jpayne@68: begin = max(row, col) jpayne@68: trace = 0 jpayne@68: if not dead_end: jpayne@68: if not reverse: jpayne@68: tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, begin, end)) jpayne@68: else: jpayne@68: tracebacks.append((ali_seqB[::-1], ali_seqA[::-1], score, begin, end)) jpayne@68: if one_alignment_only: jpayne@68: break jpayne@68: return _clean_alignments(tracebacks) jpayne@68: jpayne@68: jpayne@68: def _find_start(score_matrix, best_score, align_globally): jpayne@68: """Return a list of starting points (score, (row, col)) (PRIVATE). jpayne@68: jpayne@68: Indicating every possible place to start the tracebacks. jpayne@68: """ jpayne@68: nrows, ncols = len(score_matrix), len(score_matrix[0]) jpayne@68: # In this implementation of the global algorithm, the start will always be jpayne@68: # the bottom right corner of the matrix. jpayne@68: if align_globally: jpayne@68: starts = [(best_score, (nrows - 1, ncols - 1))] jpayne@68: else: jpayne@68: # For local alignments, there may be many different start points. jpayne@68: starts = [] jpayne@68: tolerance = 0 # XXX do anything with this? jpayne@68: # Now find all the positions within some tolerance of the best jpayne@68: # score. jpayne@68: for row in range(nrows): jpayne@68: for col in range(ncols): jpayne@68: score = score_matrix[row][col] jpayne@68: if rint(abs(score - best_score)) <= rint(tolerance): jpayne@68: starts.append((score, (row, col))) jpayne@68: return starts jpayne@68: jpayne@68: jpayne@68: def _reverse_matrices(score_matrix, trace_matrix): jpayne@68: """Reverse score and trace matrices (PRIVATE).""" jpayne@68: reverse_score_matrix = [] jpayne@68: reverse_trace_matrix = [] jpayne@68: # fmt: off jpayne@68: reverse_trace = { jpayne@68: 1: 4, 2: 2, 3: 6, 4: 1, 5: 5, 6: 3, 7: 7, 8: 16, 9: 20, 10: 18, 11: 22, 12: 17, jpayne@68: 13: 21, 14: 19, 15: 23, 16: 8, 17: 12, 18: 10, 19: 14, 20: 9, 21: 13, 22: 11, jpayne@68: 23: 15, 24: 24, 25: 28, 26: 26, 27: 30, 28: 25, 29: 29, 30: 27, 31: 31, jpayne@68: None: None, jpayne@68: } jpayne@68: # fmt: on jpayne@68: for col in range(len(score_matrix[0])): jpayne@68: new_score_row = [] jpayne@68: new_trace_row = [] jpayne@68: for row in range(len(score_matrix)): jpayne@68: new_score_row.append(score_matrix[row][col]) jpayne@68: new_trace_row.append(reverse_trace[trace_matrix[row][col]]) jpayne@68: reverse_score_matrix.append(new_score_row) jpayne@68: reverse_trace_matrix.append(new_trace_row) jpayne@68: return reverse_score_matrix, reverse_trace_matrix jpayne@68: jpayne@68: jpayne@68: def _clean_alignments(alignments): jpayne@68: """Take a list of alignments and return a cleaned version (PRIVATE). jpayne@68: jpayne@68: Remove duplicates, make sure begin and end are set correctly, remove jpayne@68: empty alignments. jpayne@68: """ jpayne@68: unique_alignments = [] jpayne@68: for align in alignments: jpayne@68: if align not in unique_alignments: jpayne@68: unique_alignments.append(align) jpayne@68: i = 0 jpayne@68: while i < len(unique_alignments): jpayne@68: seqA, seqB, score, begin, end = unique_alignments[i] jpayne@68: # Make sure end is set reasonably. jpayne@68: if end is None: # global alignment jpayne@68: end = len(seqA) jpayne@68: elif end < 0: jpayne@68: end = end + len(seqA) jpayne@68: # If there's no alignment here, get rid of it. jpayne@68: if begin >= end: jpayne@68: del unique_alignments[i] jpayne@68: continue jpayne@68: unique_alignments[i] = Alignment(seqA, seqB, score, begin, end) jpayne@68: i += 1 jpayne@68: return unique_alignments jpayne@68: jpayne@68: jpayne@68: def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, gap_char): jpayne@68: """Add remaining sequences and fill with gaps if necessary (PRIVATE).""" jpayne@68: if row: jpayne@68: ali_seqA += sequenceA[row - 1 :: -1] jpayne@68: if col: jpayne@68: ali_seqB += sequenceB[col - 1 :: -1] jpayne@68: if row > col: jpayne@68: ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB)) jpayne@68: elif col > row: jpayne@68: ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA)) jpayne@68: return ali_seqA, ali_seqB jpayne@68: jpayne@68: jpayne@68: def _find_gap_open( jpayne@68: sequenceA, jpayne@68: sequenceB, jpayne@68: ali_seqA, jpayne@68: ali_seqB, jpayne@68: end, jpayne@68: row, jpayne@68: col, jpayne@68: col_gap, jpayne@68: gap_char, jpayne@68: score_matrix, jpayne@68: trace_matrix, jpayne@68: in_process, jpayne@68: gap_fn, jpayne@68: target, jpayne@68: index, jpayne@68: direction, jpayne@68: best_score, jpayne@68: align_globally, jpayne@68: ): jpayne@68: """Find the starting point(s) of the extended gap (PRIVATE).""" jpayne@68: dead_end = False jpayne@68: target_score = score_matrix[row][col] jpayne@68: for n in range(target): jpayne@68: if direction == "col": jpayne@68: col -= 1 jpayne@68: ali_seqA += gap_char jpayne@68: ali_seqB += sequenceB[col : col + 1] jpayne@68: else: jpayne@68: row -= 1 jpayne@68: ali_seqA += sequenceA[row : row + 1] jpayne@68: ali_seqB += gap_char jpayne@68: actual_score = score_matrix[row][col] + gap_fn(index, n + 1) jpayne@68: if not align_globally and score_matrix[row][col] == best_score: jpayne@68: # We have run through a 'zero-score' extension and discard it jpayne@68: dead_end = True jpayne@68: break jpayne@68: if rint(actual_score) == rint(target_score) and n > 0: jpayne@68: if not trace_matrix[row][col]: jpayne@68: break jpayne@68: else: jpayne@68: in_process.append( jpayne@68: ( jpayne@68: ali_seqA[:], jpayne@68: ali_seqB[:], jpayne@68: end, jpayne@68: row, jpayne@68: col, jpayne@68: col_gap, jpayne@68: trace_matrix[row][col], jpayne@68: ) jpayne@68: ) jpayne@68: if not trace_matrix[row][col]: jpayne@68: dead_end = True jpayne@68: return ali_seqA, ali_seqB, row, col, in_process, dead_end jpayne@68: jpayne@68: jpayne@68: _PRECISION = 1000 jpayne@68: jpayne@68: jpayne@68: def rint(x, precision=_PRECISION): jpayne@68: """Print number with declared precision.""" jpayne@68: return int(x * precision + 0.5) jpayne@68: jpayne@68: jpayne@68: class identity_match: jpayne@68: """Create a match function for use in an alignment. jpayne@68: jpayne@68: match and mismatch are the scores to give when two residues are equal jpayne@68: or unequal. By default, match is 1 and mismatch is 0. jpayne@68: """ jpayne@68: jpayne@68: def __init__(self, match=1, mismatch=0): jpayne@68: """Initialize the class.""" jpayne@68: self.match = match jpayne@68: self.mismatch = mismatch jpayne@68: jpayne@68: def __call__(self, charA, charB): jpayne@68: """Call a match function instance already created.""" jpayne@68: if charA == charB: jpayne@68: return self.match jpayne@68: return self.mismatch jpayne@68: jpayne@68: jpayne@68: class dictionary_match: jpayne@68: """Create a match function for use in an alignment. jpayne@68: jpayne@68: Attributes: jpayne@68: - score_dict - A dictionary where the keys are tuples (residue 1, jpayne@68: residue 2) and the values are the match scores between those residues. jpayne@68: - symmetric - A flag that indicates whether the scores are symmetric. jpayne@68: jpayne@68: """ jpayne@68: jpayne@68: def __init__(self, score_dict, symmetric=1): jpayne@68: """Initialize the class.""" jpayne@68: if isinstance(score_dict, substitution_matrices.Array): jpayne@68: score_dict = dict(score_dict) # Access to dict is much faster jpayne@68: self.score_dict = score_dict jpayne@68: self.symmetric = symmetric jpayne@68: jpayne@68: def __call__(self, charA, charB): jpayne@68: """Call a dictionary match instance already created.""" jpayne@68: if self.symmetric and (charA, charB) not in self.score_dict: jpayne@68: # If the score dictionary is symmetric, then look up the jpayne@68: # score both ways. jpayne@68: charB, charA = charA, charB jpayne@68: return self.score_dict[(charA, charB)] jpayne@68: jpayne@68: jpayne@68: class affine_penalty: jpayne@68: """Create a gap function for use in an alignment.""" jpayne@68: jpayne@68: def __init__(self, open, extend, penalize_extend_when_opening=0): jpayne@68: """Initialize the class.""" jpayne@68: if open > 0 or extend > 0: jpayne@68: raise ValueError("Gap penalties should be non-positive.") jpayne@68: if not penalize_extend_when_opening and (extend < open): jpayne@68: raise ValueError( jpayne@68: "Gap opening penalty should be higher than " jpayne@68: "gap extension penalty (or equal)" jpayne@68: ) jpayne@68: self.open, self.extend = open, extend jpayne@68: self.penalize_extend_when_opening = penalize_extend_when_opening jpayne@68: jpayne@68: def __call__(self, index, length): jpayne@68: """Call a gap function instance already created.""" jpayne@68: return calc_affine_penalty( jpayne@68: length, self.open, self.extend, self.penalize_extend_when_opening jpayne@68: ) jpayne@68: jpayne@68: jpayne@68: def calc_affine_penalty(length, open, extend, penalize_extend_when_opening): jpayne@68: """Calculate a penalty score for the gap function.""" jpayne@68: if length <= 0: jpayne@68: return 0.0 jpayne@68: penalty = open + extend * length jpayne@68: if not penalize_extend_when_opening: jpayne@68: penalty -= extend jpayne@68: return penalty jpayne@68: jpayne@68: jpayne@68: def print_matrix(matrix): jpayne@68: """Print out a matrix for debugging purposes.""" jpayne@68: # Transpose the matrix and get the length of the values in each column. jpayne@68: matrixT = [[] for x in range(len(matrix[0]))] jpayne@68: for i in range(len(matrix)): jpayne@68: for j in range(len(matrix[i])): jpayne@68: matrixT[j].append(len(str(matrix[i][j]))) jpayne@68: ndigits = [max(x) for x in matrixT] jpayne@68: for i in range(len(matrix)): jpayne@68: # Using string formatting trick to add leading spaces, jpayne@68: print( jpayne@68: " ".join("%*s " % (ndigits[j], matrix[i][j]) for j in range(len(matrix[i]))) jpayne@68: ) jpayne@68: jpayne@68: jpayne@68: def format_alignment(align1, align2, score, begin, end, full_sequences=False): jpayne@68: """Format the alignment prettily into a string. jpayne@68: jpayne@68: IMPORTANT: Gap symbol must be "-" (or ['-'] for lists)! jpayne@68: jpayne@68: Since Biopython 1.71 identical matches are shown with a pipe jpayne@68: character, mismatches as a dot, and gaps as a space. jpayne@68: jpayne@68: Prior releases just used the pipe character to indicate the jpayne@68: aligned region (matches, mismatches and gaps). jpayne@68: jpayne@68: Also, in local alignments, if the alignment does not include jpayne@68: the whole sequences, now only the aligned part is shown, jpayne@68: together with the start positions of the aligned subsequences. jpayne@68: The start positions are 1-based; so start position n is the jpayne@68: n-th base/amino acid in the *un-aligned* sequence. jpayne@68: jpayne@68: NOTE: This is different to the alignment's begin/end values, jpayne@68: which give the Python indices (0-based) of the bases/amino acids jpayne@68: in the *aligned* sequences. jpayne@68: jpayne@68: If you want to restore the 'historic' behaviour, that means jpayne@68: displaying the whole sequences (including the non-aligned parts), jpayne@68: use ``full_sequences=True``. In this case, the non-aligned leading jpayne@68: and trailing parts are also indicated by spaces in the match-line. jpayne@68: """ jpayne@68: align_begin = begin jpayne@68: align_end = end jpayne@68: start1 = start2 = "" jpayne@68: start_m = begin # Begin of match line (how many spaces to include) jpayne@68: # For local alignments: jpayne@68: if not full_sequences and (begin != 0 or end != len(align1)): jpayne@68: # Calculate the actual start positions in the un-aligned sequences jpayne@68: # This will only work if the gap symbol is '-' or ['-']! jpayne@68: start1 = str(len(align1[:begin]) - align1[:begin].count("-") + 1) + " " jpayne@68: start2 = str(len(align2[:begin]) - align2[:begin].count("-") + 1) + " " jpayne@68: start_m = max(len(start1), len(start2)) jpayne@68: elif full_sequences: jpayne@68: start_m = 0 jpayne@68: begin = 0 jpayne@68: end = len(align1) jpayne@68: jpayne@68: if isinstance(align1, list): jpayne@68: # List elements will be separated by spaces, since they can be jpayne@68: # of different lengths jpayne@68: align1 = [a + " " for a in align1] jpayne@68: align2 = [a + " " for a in align2] jpayne@68: jpayne@68: s1_line = ["{:>{width}}".format(start1, width=start_m)] # seq1 line jpayne@68: m_line = [" " * start_m] # match line jpayne@68: s2_line = ["{:>{width}}".format(start2, width=start_m)] # seq2 line jpayne@68: jpayne@68: for n, (a, b) in enumerate(zip(align1[begin:end], align2[begin:end])): jpayne@68: # Since list elements can be of different length, we center them, jpayne@68: # using the maximum length of the two compared elements as width jpayne@68: m_len = max(len(a), len(b)) jpayne@68: s1_line.append("{:^{width}}".format(a, width=m_len)) jpayne@68: s2_line.append("{:^{width}}".format(b, width=m_len)) jpayne@68: if full_sequences and (n < align_begin or n >= align_end): jpayne@68: m_line.append("{:^{width}}".format(" ", width=m_len)) # space jpayne@68: continue jpayne@68: if a == b: jpayne@68: m_line.append("{:^{width}}".format("|", width=m_len)) # match jpayne@68: elif a.strip() == "-" or b.strip() == "-": jpayne@68: m_line.append("{:^{width}}".format(" ", width=m_len)) # gap jpayne@68: else: jpayne@68: m_line.append("{:^{width}}".format(".", width=m_len)) # mismatch jpayne@68: jpayne@68: s2_line.append(f"\n Score={score:g}\n") jpayne@68: return "\n".join(["".join(s1_line), "".join(m_line), "".join(s2_line)]) jpayne@68: jpayne@68: jpayne@68: # Try and load C implementations of functions. If I can't, jpayne@68: # then throw a warning and use the pure Python implementations. jpayne@68: # The redefinition is deliberate, thus the no quality assurance jpayne@68: # flag for when using flake8. jpayne@68: # Before, we secure access to the pure Python functions (for testing purposes): jpayne@68: jpayne@68: _python_make_score_matrix_fast = _make_score_matrix_fast jpayne@68: _python_rint = rint jpayne@68: jpayne@68: try: jpayne@68: from .cpairwise2 import rint, _make_score_matrix_fast # noqa jpayne@68: except ImportError: jpayne@68: warnings.warn( jpayne@68: "Import of C module failed. Falling back to pure Python " jpayne@68: "implementation. This may be slooow...", jpayne@68: BiopythonWarning, jpayne@68: ) jpayne@68: jpayne@68: if __name__ == "__main__": jpayne@68: from Bio._utils import run_doctest jpayne@68: jpayne@68: run_doctest()