jpayne@69: # Copyright 2002 by Jeffrey Chang. jpayne@69: # Copyright 2016, 2019, 2020 by Markus Piotrowski. jpayne@69: # All rights reserved. jpayne@69: # jpayne@69: # This file is part of the Biopython distribution and governed by your jpayne@69: # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". jpayne@69: # Please see the LICENSE file that should have been included as part of this jpayne@69: # package. jpayne@69: jpayne@69: """Pairwise sequence alignment using a dynamic programming algorithm. jpayne@69: jpayne@69: This provides functions to get global and local alignments between two jpayne@69: sequences. A global alignment finds the best concordance between all jpayne@69: characters in two sequences. A local alignment finds just the jpayne@69: subsequences that align the best. Local alignments must have a positive jpayne@69: score to be reported and they will not be extended for 'zero counting' jpayne@69: matches. This means a local alignment will always start and end with jpayne@69: a positive counting match. jpayne@69: jpayne@69: When doing alignments, you can specify the match score and gap jpayne@69: penalties. The match score indicates the compatibility between an jpayne@69: alignment of two characters in the sequences. Highly compatible jpayne@69: characters should be given positive scores, and incompatible ones jpayne@69: should be given negative scores or 0. The gap penalties should be jpayne@69: negative. jpayne@69: jpayne@69: The names of the alignment functions in this module follow the jpayne@69: convention jpayne@69: XX jpayne@69: where is either "global" or "local" and XX is a 2 jpayne@69: character code indicating the parameters it takes. The first jpayne@69: character indicates the parameters for matches (and mismatches), and jpayne@69: the second indicates the parameters for gap penalties. jpayne@69: jpayne@69: The match parameters are:: jpayne@69: jpayne@69: CODE DESCRIPTION & OPTIONAL KEYWORDS jpayne@69: x No parameters. Identical characters have score of 1, otherwise 0. jpayne@69: m A match score is the score of identical chars, otherwise mismatch jpayne@69: score. Keywords ``match``, ``mismatch``. jpayne@69: d A dictionary returns the score of any pair of characters. jpayne@69: Keyword ``match_dict``. jpayne@69: c A callback function returns scores. Keyword ``match_fn``. jpayne@69: jpayne@69: The gap penalty parameters are:: jpayne@69: jpayne@69: CODE DESCRIPTION & OPTIONAL KEYWORDS jpayne@69: x No gap penalties. jpayne@69: s Same open and extend gap penalties for both sequences. jpayne@69: Keywords ``open``, ``extend``. jpayne@69: d The sequences have different open and extend gap penalties. jpayne@69: Keywords ``openA``, ``extendA``, ``openB``, ``extendB``. jpayne@69: c A callback function returns the gap penalties. jpayne@69: Keywords ``gap_A_fn``, ``gap_B_fn``. jpayne@69: jpayne@69: All the different alignment functions are contained in an object jpayne@69: ``align``. For example: jpayne@69: jpayne@69: >>> from Bio import pairwise2 jpayne@69: >>> alignments = pairwise2.align.globalxx("ACCGT", "ACG") jpayne@69: jpayne@69: For better readability, the required arguments can be used with optional keywords: jpayne@69: jpayne@69: >>> alignments = pairwise2.align.globalxx(sequenceA="ACCGT", sequenceB="ACG") jpayne@69: jpayne@69: The result is a list of the alignments between the two strings. Each alignment jpayne@69: is a named tuple consisting of the two aligned sequences, the score and the jpayne@69: start and end positions of the alignment: jpayne@69: jpayne@69: >>> print(alignments) jpayne@69: [Alignment(seqA='ACCGT', seqB='A-CG-', score=3.0, start=0, end=5), ... jpayne@69: jpayne@69: You can access each element of an alignment by index or name: jpayne@69: jpayne@69: >>> alignments[0][2] jpayne@69: 3.0 jpayne@69: >>> alignments[0].score jpayne@69: 3.0 jpayne@69: jpayne@69: For a nice printout of an alignment, use the ``format_alignment`` method of jpayne@69: the module: jpayne@69: jpayne@69: >>> from Bio.pairwise2 import format_alignment jpayne@69: >>> print(format_alignment(*alignments[0])) jpayne@69: ACCGT jpayne@69: | || jpayne@69: A-CG- jpayne@69: Score=3 jpayne@69: jpayne@69: jpayne@69: All alignment functions have the following arguments: jpayne@69: jpayne@69: - Two sequences: strings, Biopython sequence objects or lists. jpayne@69: Lists are useful for supplying sequences which contain residues that are jpayne@69: encoded by more than one letter. jpayne@69: jpayne@69: - ``penalize_extend_when_opening``: boolean (default: False). jpayne@69: Whether to count an extension penalty when opening a gap. If false, a gap of jpayne@69: 1 is only penalized an "open" penalty, otherwise it is penalized jpayne@69: "open+extend". jpayne@69: jpayne@69: - ``penalize_end_gaps``: boolean. jpayne@69: Whether to count the gaps at the ends of an alignment. By default, they are jpayne@69: counted for global alignments but not for local ones. Setting jpayne@69: ``penalize_end_gaps`` to (boolean, boolean) allows you to specify for the jpayne@69: two sequences separately whether gaps at the end of the alignment should be jpayne@69: counted. jpayne@69: jpayne@69: - ``gap_char``: string (default: ``'-'``). jpayne@69: Which character to use as a gap character in the alignment returned. If your jpayne@69: input sequences are lists, you must change this to ``['-']``. jpayne@69: jpayne@69: - ``force_generic``: boolean (default: False). jpayne@69: Always use the generic, non-cached, dynamic programming function (slow!). jpayne@69: For debugging. jpayne@69: jpayne@69: - ``score_only``: boolean (default: False). jpayne@69: Only get the best score, don't recover any alignments. The return value of jpayne@69: the function is the score. Faster and uses less memory. jpayne@69: jpayne@69: - ``one_alignment_only``: boolean (default: False). jpayne@69: Only recover one alignment. jpayne@69: jpayne@69: The other parameters of the alignment function depend on the function called. jpayne@69: Some examples: jpayne@69: jpayne@69: - Find the best global alignment between the two sequences. Identical jpayne@69: characters are given 1 point. No points are deducted for mismatches or gaps. jpayne@69: jpayne@69: >>> for a in pairwise2.align.globalxx("ACCGT", "ACG"): jpayne@69: ... print(format_alignment(*a)) jpayne@69: ACCGT jpayne@69: | || jpayne@69: A-CG- jpayne@69: Score=3 jpayne@69: jpayne@69: ACCGT jpayne@69: || | jpayne@69: AC-G- jpayne@69: Score=3 jpayne@69: jpayne@69: jpayne@69: - Same thing as before, but with a local alignment. Note that jpayne@69: ``format_alignment`` will only show the aligned parts of the sequences, jpayne@69: together with the starting positions. jpayne@69: jpayne@69: >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): jpayne@69: ... print(format_alignment(*a)) jpayne@69: 1 ACCG jpayne@69: | || jpayne@69: 1 A-CG jpayne@69: Score=3 jpayne@69: jpayne@69: 1 ACCG jpayne@69: || | jpayne@69: 1 AC-G jpayne@69: Score=3 jpayne@69: jpayne@69: jpayne@69: To restore the 'historic' behaviour of ``format_alignemt``, i.e., showing jpayne@69: also the un-aligned parts of both sequences, use the new keyword parameter jpayne@69: ``full_sequences``: jpayne@69: jpayne@69: >>> for a in pairwise2.align.localxx("ACCGT", "ACG"): jpayne@69: ... print(format_alignment(*a, full_sequences=True)) jpayne@69: ACCGT jpayne@69: | || jpayne@69: A-CG- jpayne@69: Score=3 jpayne@69: jpayne@69: ACCGT jpayne@69: || | jpayne@69: AC-G- jpayne@69: Score=3 jpayne@69: jpayne@69: jpayne@69: jpayne@69: - Do a global alignment. Identical characters are given 2 points, 1 point is jpayne@69: deducted for each non-identical character. Don't penalize gaps. jpayne@69: jpayne@69: >>> for a in pairwise2.align.globalmx("ACCGT", "ACG", 2, -1): jpayne@69: ... print(format_alignment(*a)) jpayne@69: ACCGT jpayne@69: | || jpayne@69: A-CG- jpayne@69: Score=6 jpayne@69: jpayne@69: ACCGT jpayne@69: || | jpayne@69: AC-G- jpayne@69: Score=6 jpayne@69: jpayne@69: jpayne@69: - Same as above, except now 0.5 points are deducted when opening a gap, and jpayne@69: 0.1 points are deducted when extending it. jpayne@69: jpayne@69: >>> for a in pairwise2.align.globalms("ACCGT", "ACG", 2, -1, -.5, -.1): jpayne@69: ... print(format_alignment(*a)) jpayne@69: ACCGT jpayne@69: | || jpayne@69: A-CG- jpayne@69: Score=5 jpayne@69: jpayne@69: ACCGT jpayne@69: || | jpayne@69: AC-G- jpayne@69: Score=5 jpayne@69: jpayne@69: jpayne@69: - Note that you can use keywords to increase the readability, e.g.: jpayne@69: jpayne@69: >>> a = pairwise2.align.globalms("ACGT", "ACG", match=2, mismatch=-1, open=-.5, jpayne@69: ... extend=-.1) jpayne@69: jpayne@69: - Depending on the penalties, a gap in one sequence may be followed by a gap in jpayne@69: the other sequence.If you don't like this behaviour, increase the gap-open jpayne@69: penalty: jpayne@69: jpayne@69: >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -1, -.1): jpayne@69: ... print(format_alignment(*a)) jpayne@69: A- jpayne@69: jpayne@69: -T jpayne@69: Score=-2 jpayne@69: jpayne@69: >>> for a in pairwise2.align.globalms("A", "T", 5, -4, -3, -.1): jpayne@69: ... print(format_alignment(*a)) jpayne@69: A jpayne@69: . jpayne@69: T jpayne@69: Score=-4 jpayne@69: jpayne@69: jpayne@69: - The alignment function can also use known matrices already included in jpayne@69: Biopython (in ``Bio.Align.substitution_matrices``): jpayne@69: jpayne@69: >>> from Bio.Align import substitution_matrices jpayne@69: >>> matrix = substitution_matrices.load("BLOSUM62") jpayne@69: >>> for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix): jpayne@69: ... print(format_alignment(*a)) jpayne@69: KEVLA jpayne@69: ||| jpayne@69: -EVL- jpayne@69: Score=13 jpayne@69: jpayne@69: jpayne@69: - With the parameter ``c`` you can define your own match- and gap functions. jpayne@69: E.g. to define an affine logarithmic gap function and using it: jpayne@69: jpayne@69: >>> from math import log jpayne@69: >>> def gap_function(x, y): # x is gap position in seq, y is gap length jpayne@69: ... if y == 0: # No gap jpayne@69: ... return 0 jpayne@69: ... elif y == 1: # Gap open penalty jpayne@69: ... return -2 jpayne@69: ... return - (2 + y/4.0 + log(y)/2.0) jpayne@69: ... jpayne@69: >>> alignment = pairwise2.align.globalmc("ACCCCCGT", "ACG", 5, -4, jpayne@69: ... gap_function, gap_function) jpayne@69: jpayne@69: You can define different gap functions for each sequence. jpayne@69: Self-defined match functions must take the two residues to be compared and jpayne@69: return a score. jpayne@69: jpayne@69: To see a description of the parameters for a function, please look at jpayne@69: the docstring for the function via the help function, e.g. jpayne@69: type ``help(pairwise2.align.localds)`` at the Python prompt. jpayne@69: jpayne@69: """ # noqa: W291 jpayne@69: jpayne@69: import warnings jpayne@69: from collections import namedtuple jpayne@69: jpayne@69: from Bio import BiopythonWarning jpayne@69: from Bio import BiopythonDeprecationWarning jpayne@69: from Bio.Align import substitution_matrices jpayne@69: jpayne@69: warnings.warn( jpayne@69: "Bio.pairwise2 has been deprecated, and we intend to remove it in a " jpayne@69: "future release of Biopython. As an alternative, please consider using " jpayne@69: "Bio.Align.PairwiseAligner as a replacement, and contact the " jpayne@69: "Biopython developers if you still need the Bio.pairwise2 module.", jpayne@69: BiopythonDeprecationWarning, jpayne@69: ) jpayne@69: jpayne@69: jpayne@69: MAX_ALIGNMENTS = 1000 # maximum alignments recovered in traceback jpayne@69: jpayne@69: Alignment = namedtuple("Alignment", ("seqA, seqB, score, start, end")) jpayne@69: jpayne@69: jpayne@69: class align: jpayne@69: """Provide functions that do alignments. jpayne@69: jpayne@69: Alignment functions are called as: jpayne@69: jpayne@69: pairwise2.align.globalXX jpayne@69: jpayne@69: or jpayne@69: jpayne@69: pairwise2.align.localXX jpayne@69: jpayne@69: Where XX is a 2 character code indicating the match/mismatch parameters jpayne@69: (first character, either x, m, d or c) and the gap penalty parameters jpayne@69: (second character, either x, s, d, or c). jpayne@69: jpayne@69: For a detailed description read the main module's docstring (e.g., jpayne@69: type ``help(pairwise2)``). jpayne@69: To see a description of the parameters for a function, please jpayne@69: look at the docstring for the function, e.g. type jpayne@69: ``help(pairwise2.align.localds)`` at the Python prompt. jpayne@69: """ jpayne@69: jpayne@69: class alignment_function: jpayne@69: """Callable class which impersonates an alignment function. jpayne@69: jpayne@69: The constructor takes the name of the function. This class jpayne@69: will decode the name of the function to figure out how to jpayne@69: interpret the parameters. jpayne@69: """ jpayne@69: jpayne@69: # match code -> tuple of (parameters, docstring) jpayne@69: match2args = { jpayne@69: "x": ([], ""), jpayne@69: "m": ( jpayne@69: ["match", "mismatch"], jpayne@69: "match is the score to given to identical characters.\n" jpayne@69: "mismatch is the score given to non-identical ones.", jpayne@69: ), jpayne@69: "d": ( jpayne@69: ["match_dict"], jpayne@69: "match_dict is a dictionary where the keys are tuples\n" jpayne@69: "of pairs of characters and the values are the scores,\n" jpayne@69: "e.g. ('A', 'C') : 2.5.", jpayne@69: ), jpayne@69: "c": ( jpayne@69: ["match_fn"], jpayne@69: "match_fn is a callback function that takes two " jpayne@69: "characters and returns the score between them.", jpayne@69: ), jpayne@69: } jpayne@69: # penalty code -> tuple of (parameters, docstring) jpayne@69: penalty2args = { jpayne@69: "x": ([], ""), jpayne@69: "s": ( jpayne@69: ["open", "extend"], jpayne@69: "open and extend are the gap penalties when a gap is\n" jpayne@69: "opened and extended. They should be negative.", jpayne@69: ), jpayne@69: "d": ( jpayne@69: ["openA", "extendA", "openB", "extendB"], jpayne@69: "openA and extendA are the gap penalties for sequenceA,\n" jpayne@69: "and openB and extendB for sequenceB. The penalties\n" jpayne@69: "should be negative.", jpayne@69: ), jpayne@69: "c": ( jpayne@69: ["gap_A_fn", "gap_B_fn"], jpayne@69: "gap_A_fn and gap_B_fn are callback functions that takes\n" jpayne@69: "(1) the index where the gap is opened, and (2) the length\n" jpayne@69: "of the gap. They should return a gap penalty.", jpayne@69: ), jpayne@69: } jpayne@69: jpayne@69: def __init__(self, name): jpayne@69: """Check to make sure the name of the function is reasonable.""" jpayne@69: if name.startswith("global"): jpayne@69: if len(name) != 8: jpayne@69: raise AttributeError("function should be globalXX") jpayne@69: elif name.startswith("local"): jpayne@69: if len(name) != 7: jpayne@69: raise AttributeError("function should be localXX") jpayne@69: else: jpayne@69: raise AttributeError(name) jpayne@69: align_type, match_type, penalty_type = name[:-2], name[-2], name[-1] jpayne@69: try: jpayne@69: match_args, match_doc = self.match2args[match_type] jpayne@69: except KeyError: jpayne@69: raise AttributeError(f"unknown match type {match_type!r}") jpayne@69: try: jpayne@69: penalty_args, penalty_doc = self.penalty2args[penalty_type] jpayne@69: except KeyError: jpayne@69: raise AttributeError(f"unknown penalty type {penalty_type!r}") jpayne@69: jpayne@69: # Now get the names of the parameters to this function. jpayne@69: param_names = ["sequenceA", "sequenceB"] jpayne@69: param_names.extend(match_args) jpayne@69: param_names.extend(penalty_args) jpayne@69: self.function_name = name jpayne@69: self.align_type = align_type jpayne@69: self.param_names = param_names jpayne@69: jpayne@69: self.__name__ = self.function_name jpayne@69: # Set the doc string. jpayne@69: doc = f"{self.__name__}({', '.join(self.param_names)}) -> alignments\n" jpayne@69: doc += """\ jpayne@69: \nThe following parameters can also be used with optional jpayne@69: keywords of the same name.\n\n jpayne@69: sequenceA and sequenceB must be of the same type, either jpayne@69: strings, lists or Biopython sequence objects.\n jpayne@69: """ jpayne@69: if match_doc: jpayne@69: doc += f"\n{match_doc}\n" jpayne@69: if penalty_doc: jpayne@69: doc += f"\n{penalty_doc}\n" jpayne@69: doc += """\ jpayne@69: \nalignments is a list of named tuples (seqA, seqB, score, jpayne@69: begin, end). seqA and seqB are strings showing the alignment jpayne@69: between the sequences. score is the score of the alignment. jpayne@69: begin and end are indexes of seqA and seqB that indicate jpayne@69: where the alignment occurs. jpayne@69: """ jpayne@69: self.__doc__ = doc jpayne@69: jpayne@69: def decode(self, *args, **keywds): jpayne@69: """Decode the arguments for the _align function. jpayne@69: jpayne@69: keywds will get passed to it, so translate the arguments jpayne@69: to this function into forms appropriate for _align. jpayne@69: """ jpayne@69: keywds = keywds.copy() jpayne@69: jpayne@69: # Replace possible "keywords" with arguments: jpayne@69: args += (len(self.param_names) - len(args)) * (None,) jpayne@69: for key in keywds.copy(): jpayne@69: if key in self.param_names: jpayne@69: _index = self.param_names.index(key) jpayne@69: args = args[:_index] + (keywds[key],) + args[_index:] jpayne@69: del keywds[key] jpayne@69: args = tuple(arg for arg in args if arg is not None) jpayne@69: jpayne@69: if len(args) != len(self.param_names): jpayne@69: raise TypeError( jpayne@69: "%s takes exactly %d argument (%d given)" jpayne@69: % (self.function_name, len(self.param_names), len(args)) jpayne@69: ) jpayne@69: jpayne@69: i = 0 jpayne@69: while i < len(self.param_names): jpayne@69: if self.param_names[i] in [ jpayne@69: "sequenceA", jpayne@69: "sequenceB", jpayne@69: "gap_A_fn", jpayne@69: "gap_B_fn", jpayne@69: "match_fn", jpayne@69: ]: jpayne@69: keywds[self.param_names[i]] = args[i] jpayne@69: i += 1 jpayne@69: elif self.param_names[i] == "match": jpayne@69: assert self.param_names[i + 1] == "mismatch" jpayne@69: match, mismatch = args[i], args[i + 1] jpayne@69: keywds["match_fn"] = identity_match(match, mismatch) jpayne@69: i += 2 jpayne@69: elif self.param_names[i] == "match_dict": jpayne@69: keywds["match_fn"] = dictionary_match(args[i]) jpayne@69: i += 1 jpayne@69: elif self.param_names[i] == "open": jpayne@69: assert self.param_names[i + 1] == "extend" jpayne@69: open, extend = args[i], args[i + 1] jpayne@69: pe = keywds.get("penalize_extend_when_opening", 0) jpayne@69: keywds["gap_A_fn"] = affine_penalty(open, extend, pe) jpayne@69: keywds["gap_B_fn"] = affine_penalty(open, extend, pe) jpayne@69: i += 2 jpayne@69: elif self.param_names[i] == "openA": jpayne@69: assert self.param_names[i + 3] == "extendB" jpayne@69: openA, extendA, openB, extendB = args[i : i + 4] jpayne@69: pe = keywds.get("penalize_extend_when_opening", 0) jpayne@69: keywds["gap_A_fn"] = affine_penalty(openA, extendA, pe) jpayne@69: keywds["gap_B_fn"] = affine_penalty(openB, extendB, pe) jpayne@69: i += 4 jpayne@69: else: jpayne@69: raise ValueError(f"unknown parameter {self.param_names[i]!r}") jpayne@69: jpayne@69: # Here are the default parameters for _align. Assign jpayne@69: # these to keywds, unless already specified. jpayne@69: pe = keywds.get("penalize_extend_when_opening", 0) jpayne@69: default_params = [ jpayne@69: ("match_fn", identity_match(1, 0)), jpayne@69: ("gap_A_fn", affine_penalty(0, 0, pe)), jpayne@69: ("gap_B_fn", affine_penalty(0, 0, pe)), jpayne@69: ("penalize_extend_when_opening", 0), jpayne@69: ("penalize_end_gaps", self.align_type == "global"), jpayne@69: ("align_globally", self.align_type == "global"), jpayne@69: ("gap_char", "-"), jpayne@69: ("force_generic", 0), jpayne@69: ("score_only", 0), jpayne@69: ("one_alignment_only", 0), jpayne@69: ] jpayne@69: for name, default in default_params: jpayne@69: keywds[name] = keywds.get(name, default) jpayne@69: value = keywds["penalize_end_gaps"] jpayne@69: try: jpayne@69: n = len(value) jpayne@69: except TypeError: jpayne@69: keywds["penalize_end_gaps"] = tuple([value] * 2) jpayne@69: else: jpayne@69: assert n == 2 jpayne@69: return keywds jpayne@69: jpayne@69: def __call__(self, *args, **keywds): jpayne@69: """Call the alignment instance already created.""" jpayne@69: keywds = self.decode(*args, **keywds) jpayne@69: return _align(**keywds) jpayne@69: jpayne@69: def __getattr__(self, attr): jpayne@69: """Call alignment_function() to check and decode the attributes.""" jpayne@69: # The following 'magic' is needed to rewrite the class docstring jpayne@69: # dynamically: jpayne@69: wrapper = self.alignment_function(attr) jpayne@69: wrapper_type = type(wrapper) jpayne@69: wrapper_dict = wrapper_type.__dict__.copy() jpayne@69: wrapper_dict["__doc__"] = wrapper.__doc__ jpayne@69: new_alignment_function = type("alignment_function", (object,), wrapper_dict) jpayne@69: jpayne@69: return new_alignment_function(attr) jpayne@69: jpayne@69: jpayne@69: align = align() jpayne@69: jpayne@69: jpayne@69: def _align( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: match_fn, jpayne@69: gap_A_fn, jpayne@69: gap_B_fn, jpayne@69: penalize_extend_when_opening, jpayne@69: penalize_end_gaps, jpayne@69: align_globally, jpayne@69: gap_char, jpayne@69: force_generic, jpayne@69: score_only, jpayne@69: one_alignment_only, jpayne@69: ): jpayne@69: """Return optimal alignments between two sequences (PRIVATE). jpayne@69: jpayne@69: This method either returns a list of optimal alignments (with the same jpayne@69: score) or just the optimal score. jpayne@69: """ jpayne@69: if not sequenceA or not sequenceB: jpayne@69: return [] jpayne@69: try: jpayne@69: sequenceA + gap_char jpayne@69: sequenceB + gap_char jpayne@69: except TypeError: jpayne@69: raise TypeError( jpayne@69: "both sequences must be of the same type, either " jpayne@69: "string/sequence object or list. Gap character must " jpayne@69: "fit the sequence type (string or list)" jpayne@69: ) jpayne@69: jpayne@69: if not isinstance(sequenceA, list): jpayne@69: sequenceA = str(sequenceA) jpayne@69: if not isinstance(sequenceB, list): jpayne@69: sequenceB = str(sequenceB) jpayne@69: if not align_globally and (penalize_end_gaps[0] or penalize_end_gaps[1]): jpayne@69: warnings.warn( jpayne@69: '"penalize_end_gaps" should not be used in local ' jpayne@69: "alignments. The resulting score may be wrong.", jpayne@69: BiopythonWarning, jpayne@69: ) jpayne@69: jpayne@69: if ( jpayne@69: (not force_generic) jpayne@69: and isinstance(gap_A_fn, affine_penalty) jpayne@69: and isinstance(gap_B_fn, affine_penalty) jpayne@69: ): jpayne@69: open_A, extend_A = gap_A_fn.open, gap_A_fn.extend jpayne@69: open_B, extend_B = gap_B_fn.open, gap_B_fn.extend jpayne@69: matrices = _make_score_matrix_fast( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: match_fn, jpayne@69: open_A, jpayne@69: extend_A, jpayne@69: open_B, jpayne@69: extend_B, jpayne@69: penalize_extend_when_opening, jpayne@69: penalize_end_gaps, jpayne@69: align_globally, jpayne@69: score_only, jpayne@69: ) jpayne@69: else: jpayne@69: matrices = _make_score_matrix_generic( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: match_fn, jpayne@69: gap_A_fn, jpayne@69: gap_B_fn, jpayne@69: penalize_end_gaps, jpayne@69: align_globally, jpayne@69: score_only, jpayne@69: ) jpayne@69: jpayne@69: score_matrix, trace_matrix, best_score = matrices jpayne@69: jpayne@69: # print("SCORE %s" % print_matrix(score_matrix)) jpayne@69: # print("TRACEBACK %s" % print_matrix(trace_matrix)) jpayne@69: jpayne@69: # If they only want the score, then return it. jpayne@69: if score_only: jpayne@69: return best_score jpayne@69: jpayne@69: starts = _find_start(score_matrix, best_score, align_globally) jpayne@69: jpayne@69: # Recover the alignments and return them. jpayne@69: alignments = _recover_alignments( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: starts, jpayne@69: best_score, jpayne@69: score_matrix, jpayne@69: trace_matrix, jpayne@69: align_globally, jpayne@69: gap_char, jpayne@69: one_alignment_only, jpayne@69: gap_A_fn, jpayne@69: gap_B_fn, jpayne@69: ) jpayne@69: if not alignments: jpayne@69: # This may happen, see recover_alignments for explanation jpayne@69: score_matrix, trace_matrix = _reverse_matrices(score_matrix, trace_matrix) jpayne@69: starts = [(z, (y, x)) for z, (x, y) in starts] jpayne@69: alignments = _recover_alignments( jpayne@69: sequenceB, jpayne@69: sequenceA, jpayne@69: starts, jpayne@69: best_score, jpayne@69: score_matrix, jpayne@69: trace_matrix, jpayne@69: align_globally, jpayne@69: gap_char, jpayne@69: one_alignment_only, jpayne@69: gap_B_fn, jpayne@69: gap_A_fn, jpayne@69: reverse=True, jpayne@69: ) jpayne@69: return alignments jpayne@69: jpayne@69: jpayne@69: def _make_score_matrix_generic( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: match_fn, jpayne@69: gap_A_fn, jpayne@69: gap_B_fn, jpayne@69: penalize_end_gaps, jpayne@69: align_globally, jpayne@69: score_only, jpayne@69: ): jpayne@69: """Generate a score and traceback matrix (PRIVATE). jpayne@69: jpayne@69: This implementation according to Needleman-Wunsch allows the usage of jpayne@69: general gap functions and is rather slow. It is automatically called if jpayne@69: you define your own gap functions. You can force the usage of this method jpayne@69: with ``force_generic=True``. jpayne@69: """ jpayne@69: local_max_score = 0 jpayne@69: # Create the score and traceback matrices. These should be in the jpayne@69: # shape: jpayne@69: # sequenceA (down) x sequenceB (across) jpayne@69: lenA, lenB = len(sequenceA), len(sequenceB) jpayne@69: score_matrix, trace_matrix = [], [] jpayne@69: for i in range(lenA + 1): jpayne@69: score_matrix.append([None] * (lenB + 1)) jpayne@69: if not score_only: jpayne@69: trace_matrix.append([None] * (lenB + 1)) jpayne@69: jpayne@69: # Initialize first row and column with gap scores. This is like opening up jpayne@69: # i gaps at the beginning of sequence A or B. jpayne@69: for i in range(lenA + 1): jpayne@69: if penalize_end_gaps[1]: # [1]:gap in sequence B jpayne@69: score = gap_B_fn(0, i) jpayne@69: else: jpayne@69: score = 0.0 jpayne@69: score_matrix[i][0] = score jpayne@69: for i in range(lenB + 1): jpayne@69: if penalize_end_gaps[0]: # [0]:gap in sequence A jpayne@69: score = gap_A_fn(0, i) jpayne@69: else: jpayne@69: score = 0.0 jpayne@69: score_matrix[0][i] = score jpayne@69: jpayne@69: # Fill in the score matrix. Each position in the matrix jpayne@69: # represents an alignment between a character from sequence A to jpayne@69: # one in sequence B. As I iterate through the matrix, find the jpayne@69: # alignment by choose the best of: jpayne@69: # 1) extending a previous alignment without gaps jpayne@69: # 2) adding a gap in sequenceA jpayne@69: # 3) adding a gap in sequenceB jpayne@69: for row in range(1, lenA + 1): jpayne@69: for col in range(1, lenB + 1): jpayne@69: # First, calculate the score that would occur by extending jpayne@69: # the alignment without gaps. jpayne@69: # fmt: off jpayne@69: nogap_score = ( jpayne@69: score_matrix[row - 1][col - 1] jpayne@69: + match_fn(sequenceA[row - 1], sequenceB[col - 1]) jpayne@69: ) jpayne@69: jpayne@69: # fmt: on jpayne@69: # Try to find a better score by opening gaps in sequenceA. jpayne@69: # Do this by checking alignments from each column in the row. jpayne@69: # Each column represents a different character to align from, jpayne@69: # and thus a different length gap. jpayne@69: # Although the gap function does not distinguish between opening jpayne@69: # and extending a gap, we distinguish them for the backtrace. jpayne@69: if not penalize_end_gaps[0] and row == lenA: jpayne@69: row_open = score_matrix[row][col - 1] jpayne@69: row_extend = max(score_matrix[row][x] for x in range(col)) jpayne@69: else: jpayne@69: row_open = score_matrix[row][col - 1] + gap_A_fn(row, 1) jpayne@69: row_extend = max( jpayne@69: score_matrix[row][x] + gap_A_fn(row, col - x) for x in range(col) jpayne@69: ) jpayne@69: jpayne@69: # Try to find a better score by opening gaps in sequenceB. jpayne@69: if not penalize_end_gaps[1] and col == lenB: jpayne@69: col_open = score_matrix[row - 1][col] jpayne@69: col_extend = max(score_matrix[x][col] for x in range(row)) jpayne@69: else: jpayne@69: col_open = score_matrix[row - 1][col] + gap_B_fn(col, 1) jpayne@69: col_extend = max( jpayne@69: score_matrix[x][col] + gap_B_fn(col, row - x) for x in range(row) jpayne@69: ) jpayne@69: jpayne@69: best_score = max(nogap_score, row_open, row_extend, col_open, col_extend) jpayne@69: local_max_score = max(local_max_score, best_score) jpayne@69: if not align_globally and best_score < 0: jpayne@69: score_matrix[row][col] = 0.0 jpayne@69: else: jpayne@69: score_matrix[row][col] = best_score jpayne@69: jpayne@69: # The backtrace is encoded binary. See _make_score_matrix_fast jpayne@69: # for details. jpayne@69: if not score_only: jpayne@69: trace_score = 0 jpayne@69: if rint(nogap_score) == rint(best_score): jpayne@69: trace_score += 2 jpayne@69: if rint(row_open) == rint(best_score): jpayne@69: trace_score += 1 jpayne@69: if rint(row_extend) == rint(best_score): jpayne@69: trace_score += 8 jpayne@69: if rint(col_open) == rint(best_score): jpayne@69: trace_score += 4 jpayne@69: if rint(col_extend) == rint(best_score): jpayne@69: trace_score += 16 jpayne@69: trace_matrix[row][col] = trace_score jpayne@69: jpayne@69: if not align_globally: jpayne@69: best_score = local_max_score jpayne@69: jpayne@69: return score_matrix, trace_matrix, best_score jpayne@69: jpayne@69: jpayne@69: def _make_score_matrix_fast( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: match_fn, jpayne@69: open_A, jpayne@69: extend_A, jpayne@69: open_B, jpayne@69: extend_B, jpayne@69: penalize_extend_when_opening, jpayne@69: penalize_end_gaps, jpayne@69: align_globally, jpayne@69: score_only, jpayne@69: ): jpayne@69: """Generate a score and traceback matrix according to Gotoh (PRIVATE). jpayne@69: jpayne@69: This is an implementation of the Needleman-Wunsch dynamic programming jpayne@69: algorithm as modified by Gotoh, implementing affine gap penalties. jpayne@69: In short, we have three matrices, holding scores for alignments ending jpayne@69: in (1) a match/mismatch, (2) a gap in sequence A, and (3) a gap in jpayne@69: sequence B, respectively. However, we can combine them in one matrix, jpayne@69: which holds the best scores, and store only those values from the jpayne@69: other matrices that are actually used for the next step of calculation. jpayne@69: The traceback matrix holds the positions for backtracing the alignment. jpayne@69: """ jpayne@69: first_A_gap = calc_affine_penalty(1, open_A, extend_A, penalize_extend_when_opening) jpayne@69: first_B_gap = calc_affine_penalty(1, open_B, extend_B, penalize_extend_when_opening) jpayne@69: local_max_score = 0 jpayne@69: jpayne@69: # Create the score and traceback matrices. These should be in the jpayne@69: # shape: jpayne@69: # sequenceA (down) x sequenceB (across) jpayne@69: lenA, lenB = len(sequenceA), len(sequenceB) jpayne@69: score_matrix, trace_matrix = [], [] jpayne@69: for i in range(lenA + 1): jpayne@69: score_matrix.append([None] * (lenB + 1)) jpayne@69: if not score_only: jpayne@69: trace_matrix.append([None] * (lenB + 1)) jpayne@69: jpayne@69: # Initialize first row and column with gap scores. This is like opening up jpayne@69: # i gaps at the beginning of sequence A or B. jpayne@69: for i in range(lenA + 1): jpayne@69: if penalize_end_gaps[1]: # [1]:gap in sequence B jpayne@69: score = calc_affine_penalty( jpayne@69: i, open_B, extend_B, penalize_extend_when_opening jpayne@69: ) jpayne@69: else: jpayne@69: score = 0 jpayne@69: score_matrix[i][0] = score jpayne@69: for i in range(lenB + 1): jpayne@69: if penalize_end_gaps[0]: # [0]:gap in sequence A jpayne@69: score = calc_affine_penalty( jpayne@69: i, open_A, extend_A, penalize_extend_when_opening jpayne@69: ) jpayne@69: else: jpayne@69: score = 0 jpayne@69: score_matrix[0][i] = score jpayne@69: jpayne@69: # Now initialize the col 'matrix'. Actually this is only a one dimensional jpayne@69: # list, since we only need the col scores from the last row. jpayne@69: col_score = [0] # Best score, if actual alignment ends with gap in seqB jpayne@69: for i in range(1, lenB + 1): jpayne@69: col_score.append( jpayne@69: calc_affine_penalty(i, 2 * open_B, extend_B, penalize_extend_when_opening) jpayne@69: ) jpayne@69: jpayne@69: # The row 'matrix' is calculated on the fly. Here we only need the actual jpayne@69: # score. jpayne@69: # Now, filling up the score and traceback matrices: jpayne@69: for row in range(1, lenA + 1): jpayne@69: row_score = calc_affine_penalty( jpayne@69: row, 2 * open_A, extend_A, penalize_extend_when_opening jpayne@69: ) jpayne@69: for col in range(1, lenB + 1): jpayne@69: # Calculate the score that would occur by extending the jpayne@69: # alignment without gaps. jpayne@69: # fmt: off jpayne@69: nogap_score = ( jpayne@69: score_matrix[row - 1][col - 1] jpayne@69: + match_fn(sequenceA[row - 1], sequenceB[col - 1]) jpayne@69: ) jpayne@69: # fmt: on jpayne@69: # Check the score that would occur if there were a gap in jpayne@69: # sequence A. This could come from opening a new gap or jpayne@69: # extending an existing one. jpayne@69: # A gap in sequence A can also be opened if it follows a gap in jpayne@69: # sequence B: A- jpayne@69: # -B jpayne@69: if not penalize_end_gaps[0] and row == lenA: jpayne@69: row_open = score_matrix[row][col - 1] jpayne@69: row_extend = row_score jpayne@69: else: jpayne@69: row_open = score_matrix[row][col - 1] + first_A_gap jpayne@69: row_extend = row_score + extend_A jpayne@69: row_score = max(row_open, row_extend) jpayne@69: jpayne@69: # The same for sequence B: jpayne@69: if not penalize_end_gaps[1] and col == lenB: jpayne@69: col_open = score_matrix[row - 1][col] jpayne@69: col_extend = col_score[col] jpayne@69: else: jpayne@69: col_open = score_matrix[row - 1][col] + first_B_gap jpayne@69: col_extend = col_score[col] + extend_B jpayne@69: col_score[col] = max(col_open, col_extend) jpayne@69: jpayne@69: best_score = max(nogap_score, col_score[col], row_score) jpayne@69: local_max_score = max(local_max_score, best_score) jpayne@69: if not align_globally and best_score < 0: jpayne@69: score_matrix[row][col] = 0 jpayne@69: else: jpayne@69: score_matrix[row][col] = best_score jpayne@69: jpayne@69: # Now the trace_matrix. The edges of the backtrace are encoded jpayne@69: # binary: 1 = open gap in seqA, 2 = match/mismatch of seqA and jpayne@69: # seqB, 4 = open gap in seqB, 8 = extend gap in seqA, and jpayne@69: # 16 = extend gap in seqB. This values can be summed up. jpayne@69: # Thus, the trace score 7 means that the best score can either jpayne@69: # come from opening a gap in seqA (=1), pairing two characters jpayne@69: # of seqA and seqB (+2=3) or opening a gap in seqB (+4=7). jpayne@69: # However, if we only want the score we don't care about the trace. jpayne@69: if not score_only: jpayne@69: row_score_rint = rint(row_score) jpayne@69: col_score_rint = rint(col_score[col]) jpayne@69: row_trace_score = 0 jpayne@69: col_trace_score = 0 jpayne@69: if rint(row_open) == row_score_rint: jpayne@69: row_trace_score += 1 # Open gap in seqA jpayne@69: if rint(row_extend) == row_score_rint: jpayne@69: row_trace_score += 8 # Extend gap in seqA jpayne@69: if rint(col_open) == col_score_rint: jpayne@69: col_trace_score += 4 # Open gap in seqB jpayne@69: if rint(col_extend) == col_score_rint: jpayne@69: col_trace_score += 16 # Extend gap in seqB jpayne@69: jpayne@69: trace_score = 0 jpayne@69: best_score_rint = rint(best_score) jpayne@69: if rint(nogap_score) == best_score_rint: jpayne@69: trace_score += 2 # Align seqA with seqB jpayne@69: if row_score_rint == best_score_rint: jpayne@69: trace_score += row_trace_score jpayne@69: if col_score_rint == best_score_rint: jpayne@69: trace_score += col_trace_score jpayne@69: trace_matrix[row][col] = trace_score jpayne@69: jpayne@69: if not align_globally: jpayne@69: best_score = local_max_score jpayne@69: jpayne@69: return score_matrix, trace_matrix, best_score jpayne@69: jpayne@69: jpayne@69: def _recover_alignments( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: starts, jpayne@69: best_score, jpayne@69: score_matrix, jpayne@69: trace_matrix, jpayne@69: align_globally, jpayne@69: gap_char, jpayne@69: one_alignment_only, jpayne@69: gap_A_fn, jpayne@69: gap_B_fn, jpayne@69: reverse=False, jpayne@69: ): jpayne@69: """Do the backtracing and return a list of alignments (PRIVATE). jpayne@69: jpayne@69: Recover the alignments by following the traceback matrix. This jpayne@69: is a recursive procedure, but it's implemented here iteratively jpayne@69: with a stack. jpayne@69: jpayne@69: sequenceA and sequenceB may be sequences, including strings, jpayne@69: lists, or list-like objects. In order to preserve the type of jpayne@69: the object, we need to use slices on the sequences instead of jpayne@69: indexes. For example, sequenceA[row] may return a type that's jpayne@69: not compatible with sequenceA, e.g. if sequenceA is a list and jpayne@69: sequenceA[row] is a string. Thus, avoid using indexes and use jpayne@69: slices, e.g. sequenceA[row:row+1]. Assume that client-defined jpayne@69: sequence classes preserve these semantics. jpayne@69: """ jpayne@69: lenA, lenB = len(sequenceA), len(sequenceB) jpayne@69: ali_seqA, ali_seqB = sequenceA[0:0], sequenceB[0:0] jpayne@69: tracebacks = [] jpayne@69: in_process = [] jpayne@69: jpayne@69: for start in starts: jpayne@69: score, (row, col) = start jpayne@69: begin = 0 jpayne@69: if align_globally: jpayne@69: end = None jpayne@69: else: jpayne@69: # If this start is a zero-extension: don't start here! jpayne@69: if (score, (row - 1, col - 1)) in starts: jpayne@69: continue jpayne@69: # Local alignments should start with a positive score! jpayne@69: if score <= 0: jpayne@69: continue jpayne@69: # Local alignments should not end with a gap!: jpayne@69: trace = trace_matrix[row][col] jpayne@69: if (trace - trace % 2) % 4 == 2: # Trace contains 'nogap', fine! jpayne@69: trace_matrix[row][col] = 2 jpayne@69: # If not, don't start here! jpayne@69: else: jpayne@69: continue jpayne@69: end = -max(lenA - row, lenB - col) jpayne@69: if not end: jpayne@69: end = None jpayne@69: col_distance = lenB - col jpayne@69: row_distance = lenA - row jpayne@69: jpayne@69: # fmt: off jpayne@69: ali_seqA = ( jpayne@69: (col_distance - row_distance) * gap_char jpayne@69: + sequenceA[lenA - 1 : row - 1 : -1] jpayne@69: ) jpayne@69: ali_seqB = ( jpayne@69: (row_distance - col_distance) * gap_char jpayne@69: + sequenceB[lenB - 1 : col - 1 : -1] jpayne@69: ) jpayne@69: # fmt: on jpayne@69: in_process += [ jpayne@69: (ali_seqA, ali_seqB, end, row, col, False, trace_matrix[row][col]) jpayne@69: ] jpayne@69: while in_process and len(tracebacks) < MAX_ALIGNMENTS: jpayne@69: # Although we allow a gap in seqB to be followed by a gap in seqA, jpayne@69: # we don't want to allow it the other way round, since this would jpayne@69: # give redundant alignments of type: A- vs. -A jpayne@69: # -B B- jpayne@69: # Thus we need to keep track if a gap in seqA was opened (col_gap) jpayne@69: # and stop the backtrace (dead_end) if a gap in seqB follows. jpayne@69: # jpayne@69: # Attention: This may fail, if the gap-penalties for both strands are jpayne@69: # different. In this case the second alignment may be the only optimal jpayne@69: # alignment. Thus it can happen that no alignment is returned. For jpayne@69: # this case a workaround was implemented, which reverses the input and jpayne@69: # the matrices (this happens in _reverse_matrices) and repeats the jpayne@69: # backtrace. The variable 'reverse' keeps track of this. jpayne@69: dead_end = False jpayne@69: ali_seqA, ali_seqB, end, row, col, col_gap, trace = in_process.pop() jpayne@69: jpayne@69: while (row > 0 or col > 0) and not dead_end: jpayne@69: cache = (ali_seqA[:], ali_seqB[:], end, row, col, col_gap) jpayne@69: jpayne@69: # If trace is empty we have reached at least one border of the jpayne@69: # matrix or the end of a local alignment. Just add the rest of jpayne@69: # the sequence(s) and fill with gaps if necessary. jpayne@69: if not trace: jpayne@69: if col and col_gap: jpayne@69: dead_end = True jpayne@69: else: jpayne@69: ali_seqA, ali_seqB = _finish_backtrace( jpayne@69: sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, gap_char jpayne@69: ) jpayne@69: break jpayne@69: elif trace % 2 == 1: # = row open = open gap in seqA jpayne@69: trace -= 1 jpayne@69: if col_gap: jpayne@69: dead_end = True jpayne@69: else: jpayne@69: col -= 1 jpayne@69: ali_seqA += gap_char jpayne@69: ali_seqB += sequenceB[col : col + 1] jpayne@69: col_gap = False jpayne@69: elif trace % 4 == 2: # = match/mismatch of seqA with seqB jpayne@69: trace -= 2 jpayne@69: row -= 1 jpayne@69: col -= 1 jpayne@69: ali_seqA += sequenceA[row : row + 1] jpayne@69: ali_seqB += sequenceB[col : col + 1] jpayne@69: col_gap = False jpayne@69: elif trace % 8 == 4: # = col open = open gap in seqB jpayne@69: trace -= 4 jpayne@69: row -= 1 jpayne@69: ali_seqA += sequenceA[row : row + 1] jpayne@69: ali_seqB += gap_char jpayne@69: col_gap = True jpayne@69: elif trace in (8, 24): # = row extend = extend gap in seqA jpayne@69: trace -= 8 jpayne@69: if col_gap: jpayne@69: dead_end = True jpayne@69: else: jpayne@69: col_gap = False jpayne@69: # We need to find the starting point of the extended gap jpayne@69: x = _find_gap_open( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: ali_seqA, jpayne@69: ali_seqB, jpayne@69: end, jpayne@69: row, jpayne@69: col, jpayne@69: col_gap, jpayne@69: gap_char, jpayne@69: score_matrix, jpayne@69: trace_matrix, jpayne@69: in_process, jpayne@69: gap_A_fn, jpayne@69: col, jpayne@69: row, jpayne@69: "col", jpayne@69: best_score, jpayne@69: align_globally, jpayne@69: ) jpayne@69: ali_seqA, ali_seqB, row, col, in_process, dead_end = x jpayne@69: elif trace == 16: # = col extend = extend gap in seqB jpayne@69: trace -= 16 jpayne@69: col_gap = True jpayne@69: x = _find_gap_open( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: ali_seqA, jpayne@69: ali_seqB, jpayne@69: end, jpayne@69: row, jpayne@69: col, jpayne@69: col_gap, jpayne@69: gap_char, jpayne@69: score_matrix, jpayne@69: trace_matrix, jpayne@69: in_process, jpayne@69: gap_B_fn, jpayne@69: row, jpayne@69: col, jpayne@69: "row", jpayne@69: best_score, jpayne@69: align_globally, jpayne@69: ) jpayne@69: ali_seqA, ali_seqB, row, col, in_process, dead_end = x jpayne@69: jpayne@69: if trace: # There is another path to follow... jpayne@69: cache += (trace,) jpayne@69: in_process.append(cache) jpayne@69: trace = trace_matrix[row][col] jpayne@69: if not align_globally: jpayne@69: if score_matrix[row][col] == best_score: jpayne@69: # We have gone through a 'zero-score' extension, discard it jpayne@69: dead_end = True jpayne@69: elif score_matrix[row][col] <= 0: jpayne@69: # We have reached the end of the backtrace jpayne@69: begin = max(row, col) jpayne@69: trace = 0 jpayne@69: if not dead_end: jpayne@69: if not reverse: jpayne@69: tracebacks.append((ali_seqA[::-1], ali_seqB[::-1], score, begin, end)) jpayne@69: else: jpayne@69: tracebacks.append((ali_seqB[::-1], ali_seqA[::-1], score, begin, end)) jpayne@69: if one_alignment_only: jpayne@69: break jpayne@69: return _clean_alignments(tracebacks) jpayne@69: jpayne@69: jpayne@69: def _find_start(score_matrix, best_score, align_globally): jpayne@69: """Return a list of starting points (score, (row, col)) (PRIVATE). jpayne@69: jpayne@69: Indicating every possible place to start the tracebacks. jpayne@69: """ jpayne@69: nrows, ncols = len(score_matrix), len(score_matrix[0]) jpayne@69: # In this implementation of the global algorithm, the start will always be jpayne@69: # the bottom right corner of the matrix. jpayne@69: if align_globally: jpayne@69: starts = [(best_score, (nrows - 1, ncols - 1))] jpayne@69: else: jpayne@69: # For local alignments, there may be many different start points. jpayne@69: starts = [] jpayne@69: tolerance = 0 # XXX do anything with this? jpayne@69: # Now find all the positions within some tolerance of the best jpayne@69: # score. jpayne@69: for row in range(nrows): jpayne@69: for col in range(ncols): jpayne@69: score = score_matrix[row][col] jpayne@69: if rint(abs(score - best_score)) <= rint(tolerance): jpayne@69: starts.append((score, (row, col))) jpayne@69: return starts jpayne@69: jpayne@69: jpayne@69: def _reverse_matrices(score_matrix, trace_matrix): jpayne@69: """Reverse score and trace matrices (PRIVATE).""" jpayne@69: reverse_score_matrix = [] jpayne@69: reverse_trace_matrix = [] jpayne@69: # fmt: off jpayne@69: reverse_trace = { jpayne@69: 1: 4, 2: 2, 3: 6, 4: 1, 5: 5, 6: 3, 7: 7, 8: 16, 9: 20, 10: 18, 11: 22, 12: 17, jpayne@69: 13: 21, 14: 19, 15: 23, 16: 8, 17: 12, 18: 10, 19: 14, 20: 9, 21: 13, 22: 11, jpayne@69: 23: 15, 24: 24, 25: 28, 26: 26, 27: 30, 28: 25, 29: 29, 30: 27, 31: 31, jpayne@69: None: None, jpayne@69: } jpayne@69: # fmt: on jpayne@69: for col in range(len(score_matrix[0])): jpayne@69: new_score_row = [] jpayne@69: new_trace_row = [] jpayne@69: for row in range(len(score_matrix)): jpayne@69: new_score_row.append(score_matrix[row][col]) jpayne@69: new_trace_row.append(reverse_trace[trace_matrix[row][col]]) jpayne@69: reverse_score_matrix.append(new_score_row) jpayne@69: reverse_trace_matrix.append(new_trace_row) jpayne@69: return reverse_score_matrix, reverse_trace_matrix jpayne@69: jpayne@69: jpayne@69: def _clean_alignments(alignments): jpayne@69: """Take a list of alignments and return a cleaned version (PRIVATE). jpayne@69: jpayne@69: Remove duplicates, make sure begin and end are set correctly, remove jpayne@69: empty alignments. jpayne@69: """ jpayne@69: unique_alignments = [] jpayne@69: for align in alignments: jpayne@69: if align not in unique_alignments: jpayne@69: unique_alignments.append(align) jpayne@69: i = 0 jpayne@69: while i < len(unique_alignments): jpayne@69: seqA, seqB, score, begin, end = unique_alignments[i] jpayne@69: # Make sure end is set reasonably. jpayne@69: if end is None: # global alignment jpayne@69: end = len(seqA) jpayne@69: elif end < 0: jpayne@69: end = end + len(seqA) jpayne@69: # If there's no alignment here, get rid of it. jpayne@69: if begin >= end: jpayne@69: del unique_alignments[i] jpayne@69: continue jpayne@69: unique_alignments[i] = Alignment(seqA, seqB, score, begin, end) jpayne@69: i += 1 jpayne@69: return unique_alignments jpayne@69: jpayne@69: jpayne@69: def _finish_backtrace(sequenceA, sequenceB, ali_seqA, ali_seqB, row, col, gap_char): jpayne@69: """Add remaining sequences and fill with gaps if necessary (PRIVATE).""" jpayne@69: if row: jpayne@69: ali_seqA += sequenceA[row - 1 :: -1] jpayne@69: if col: jpayne@69: ali_seqB += sequenceB[col - 1 :: -1] jpayne@69: if row > col: jpayne@69: ali_seqB += gap_char * (len(ali_seqA) - len(ali_seqB)) jpayne@69: elif col > row: jpayne@69: ali_seqA += gap_char * (len(ali_seqB) - len(ali_seqA)) jpayne@69: return ali_seqA, ali_seqB jpayne@69: jpayne@69: jpayne@69: def _find_gap_open( jpayne@69: sequenceA, jpayne@69: sequenceB, jpayne@69: ali_seqA, jpayne@69: ali_seqB, jpayne@69: end, jpayne@69: row, jpayne@69: col, jpayne@69: col_gap, jpayne@69: gap_char, jpayne@69: score_matrix, jpayne@69: trace_matrix, jpayne@69: in_process, jpayne@69: gap_fn, jpayne@69: target, jpayne@69: index, jpayne@69: direction, jpayne@69: best_score, jpayne@69: align_globally, jpayne@69: ): jpayne@69: """Find the starting point(s) of the extended gap (PRIVATE).""" jpayne@69: dead_end = False jpayne@69: target_score = score_matrix[row][col] jpayne@69: for n in range(target): jpayne@69: if direction == "col": jpayne@69: col -= 1 jpayne@69: ali_seqA += gap_char jpayne@69: ali_seqB += sequenceB[col : col + 1] jpayne@69: else: jpayne@69: row -= 1 jpayne@69: ali_seqA += sequenceA[row : row + 1] jpayne@69: ali_seqB += gap_char jpayne@69: actual_score = score_matrix[row][col] + gap_fn(index, n + 1) jpayne@69: if not align_globally and score_matrix[row][col] == best_score: jpayne@69: # We have run through a 'zero-score' extension and discard it jpayne@69: dead_end = True jpayne@69: break jpayne@69: if rint(actual_score) == rint(target_score) and n > 0: jpayne@69: if not trace_matrix[row][col]: jpayne@69: break jpayne@69: else: jpayne@69: in_process.append( jpayne@69: ( jpayne@69: ali_seqA[:], jpayne@69: ali_seqB[:], jpayne@69: end, jpayne@69: row, jpayne@69: col, jpayne@69: col_gap, jpayne@69: trace_matrix[row][col], jpayne@69: ) jpayne@69: ) jpayne@69: if not trace_matrix[row][col]: jpayne@69: dead_end = True jpayne@69: return ali_seqA, ali_seqB, row, col, in_process, dead_end jpayne@69: jpayne@69: jpayne@69: _PRECISION = 1000 jpayne@69: jpayne@69: jpayne@69: def rint(x, precision=_PRECISION): jpayne@69: """Print number with declared precision.""" jpayne@69: return int(x * precision + 0.5) jpayne@69: jpayne@69: jpayne@69: class identity_match: jpayne@69: """Create a match function for use in an alignment. jpayne@69: jpayne@69: match and mismatch are the scores to give when two residues are equal jpayne@69: or unequal. By default, match is 1 and mismatch is 0. jpayne@69: """ jpayne@69: jpayne@69: def __init__(self, match=1, mismatch=0): jpayne@69: """Initialize the class.""" jpayne@69: self.match = match jpayne@69: self.mismatch = mismatch jpayne@69: jpayne@69: def __call__(self, charA, charB): jpayne@69: """Call a match function instance already created.""" jpayne@69: if charA == charB: jpayne@69: return self.match jpayne@69: return self.mismatch jpayne@69: jpayne@69: jpayne@69: class dictionary_match: jpayne@69: """Create a match function for use in an alignment. jpayne@69: jpayne@69: Attributes: jpayne@69: - score_dict - A dictionary where the keys are tuples (residue 1, jpayne@69: residue 2) and the values are the match scores between those residues. jpayne@69: - symmetric - A flag that indicates whether the scores are symmetric. jpayne@69: jpayne@69: """ jpayne@69: jpayne@69: def __init__(self, score_dict, symmetric=1): jpayne@69: """Initialize the class.""" jpayne@69: if isinstance(score_dict, substitution_matrices.Array): jpayne@69: score_dict = dict(score_dict) # Access to dict is much faster jpayne@69: self.score_dict = score_dict jpayne@69: self.symmetric = symmetric jpayne@69: jpayne@69: def __call__(self, charA, charB): jpayne@69: """Call a dictionary match instance already created.""" jpayne@69: if self.symmetric and (charA, charB) not in self.score_dict: jpayne@69: # If the score dictionary is symmetric, then look up the jpayne@69: # score both ways. jpayne@69: charB, charA = charA, charB jpayne@69: return self.score_dict[(charA, charB)] jpayne@69: jpayne@69: jpayne@69: class affine_penalty: jpayne@69: """Create a gap function for use in an alignment.""" jpayne@69: jpayne@69: def __init__(self, open, extend, penalize_extend_when_opening=0): jpayne@69: """Initialize the class.""" jpayne@69: if open > 0 or extend > 0: jpayne@69: raise ValueError("Gap penalties should be non-positive.") jpayne@69: if not penalize_extend_when_opening and (extend < open): jpayne@69: raise ValueError( jpayne@69: "Gap opening penalty should be higher than " jpayne@69: "gap extension penalty (or equal)" jpayne@69: ) jpayne@69: self.open, self.extend = open, extend jpayne@69: self.penalize_extend_when_opening = penalize_extend_when_opening jpayne@69: jpayne@69: def __call__(self, index, length): jpayne@69: """Call a gap function instance already created.""" jpayne@69: return calc_affine_penalty( jpayne@69: length, self.open, self.extend, self.penalize_extend_when_opening jpayne@69: ) jpayne@69: jpayne@69: jpayne@69: def calc_affine_penalty(length, open, extend, penalize_extend_when_opening): jpayne@69: """Calculate a penalty score for the gap function.""" jpayne@69: if length <= 0: jpayne@69: return 0.0 jpayne@69: penalty = open + extend * length jpayne@69: if not penalize_extend_when_opening: jpayne@69: penalty -= extend jpayne@69: return penalty jpayne@69: jpayne@69: jpayne@69: def print_matrix(matrix): jpayne@69: """Print out a matrix for debugging purposes.""" jpayne@69: # Transpose the matrix and get the length of the values in each column. jpayne@69: matrixT = [[] for x in range(len(matrix[0]))] jpayne@69: for i in range(len(matrix)): jpayne@69: for j in range(len(matrix[i])): jpayne@69: matrixT[j].append(len(str(matrix[i][j]))) jpayne@69: ndigits = [max(x) for x in matrixT] jpayne@69: for i in range(len(matrix)): jpayne@69: # Using string formatting trick to add leading spaces, jpayne@69: print( jpayne@69: " ".join("%*s " % (ndigits[j], matrix[i][j]) for j in range(len(matrix[i]))) jpayne@69: ) jpayne@69: jpayne@69: jpayne@69: def format_alignment(align1, align2, score, begin, end, full_sequences=False): jpayne@69: """Format the alignment prettily into a string. jpayne@69: jpayne@69: IMPORTANT: Gap symbol must be "-" (or ['-'] for lists)! jpayne@69: jpayne@69: Since Biopython 1.71 identical matches are shown with a pipe jpayne@69: character, mismatches as a dot, and gaps as a space. jpayne@69: jpayne@69: Prior releases just used the pipe character to indicate the jpayne@69: aligned region (matches, mismatches and gaps). jpayne@69: jpayne@69: Also, in local alignments, if the alignment does not include jpayne@69: the whole sequences, now only the aligned part is shown, jpayne@69: together with the start positions of the aligned subsequences. jpayne@69: The start positions are 1-based; so start position n is the jpayne@69: n-th base/amino acid in the *un-aligned* sequence. jpayne@69: jpayne@69: NOTE: This is different to the alignment's begin/end values, jpayne@69: which give the Python indices (0-based) of the bases/amino acids jpayne@69: in the *aligned* sequences. jpayne@69: jpayne@69: If you want to restore the 'historic' behaviour, that means jpayne@69: displaying the whole sequences (including the non-aligned parts), jpayne@69: use ``full_sequences=True``. In this case, the non-aligned leading jpayne@69: and trailing parts are also indicated by spaces in the match-line. jpayne@69: """ jpayne@69: align_begin = begin jpayne@69: align_end = end jpayne@69: start1 = start2 = "" jpayne@69: start_m = begin # Begin of match line (how many spaces to include) jpayne@69: # For local alignments: jpayne@69: if not full_sequences and (begin != 0 or end != len(align1)): jpayne@69: # Calculate the actual start positions in the un-aligned sequences jpayne@69: # This will only work if the gap symbol is '-' or ['-']! jpayne@69: start1 = str(len(align1[:begin]) - align1[:begin].count("-") + 1) + " " jpayne@69: start2 = str(len(align2[:begin]) - align2[:begin].count("-") + 1) + " " jpayne@69: start_m = max(len(start1), len(start2)) jpayne@69: elif full_sequences: jpayne@69: start_m = 0 jpayne@69: begin = 0 jpayne@69: end = len(align1) jpayne@69: jpayne@69: if isinstance(align1, list): jpayne@69: # List elements will be separated by spaces, since they can be jpayne@69: # of different lengths jpayne@69: align1 = [a + " " for a in align1] jpayne@69: align2 = [a + " " for a in align2] jpayne@69: jpayne@69: s1_line = ["{:>{width}}".format(start1, width=start_m)] # seq1 line jpayne@69: m_line = [" " * start_m] # match line jpayne@69: s2_line = ["{:>{width}}".format(start2, width=start_m)] # seq2 line jpayne@69: jpayne@69: for n, (a, b) in enumerate(zip(align1[begin:end], align2[begin:end])): jpayne@69: # Since list elements can be of different length, we center them, jpayne@69: # using the maximum length of the two compared elements as width jpayne@69: m_len = max(len(a), len(b)) jpayne@69: s1_line.append("{:^{width}}".format(a, width=m_len)) jpayne@69: s2_line.append("{:^{width}}".format(b, width=m_len)) jpayne@69: if full_sequences and (n < align_begin or n >= align_end): jpayne@69: m_line.append("{:^{width}}".format(" ", width=m_len)) # space jpayne@69: continue jpayne@69: if a == b: jpayne@69: m_line.append("{:^{width}}".format("|", width=m_len)) # match jpayne@69: elif a.strip() == "-" or b.strip() == "-": jpayne@69: m_line.append("{:^{width}}".format(" ", width=m_len)) # gap jpayne@69: else: jpayne@69: m_line.append("{:^{width}}".format(".", width=m_len)) # mismatch jpayne@69: jpayne@69: s2_line.append(f"\n Score={score:g}\n") jpayne@69: return "\n".join(["".join(s1_line), "".join(m_line), "".join(s2_line)]) jpayne@69: jpayne@69: jpayne@69: # Try and load C implementations of functions. If I can't, jpayne@69: # then throw a warning and use the pure Python implementations. jpayne@69: # The redefinition is deliberate, thus the no quality assurance jpayne@69: # flag for when using flake8. jpayne@69: # Before, we secure access to the pure Python functions (for testing purposes): jpayne@69: jpayne@69: _python_make_score_matrix_fast = _make_score_matrix_fast jpayne@69: _python_rint = rint jpayne@69: jpayne@69: try: jpayne@69: from .cpairwise2 import rint, _make_score_matrix_fast # noqa jpayne@69: except ImportError: jpayne@69: warnings.warn( jpayne@69: "Import of C module failed. Falling back to pure Python " jpayne@69: "implementation. This may be slooow...", jpayne@69: BiopythonWarning, jpayne@69: ) jpayne@69: jpayne@69: if __name__ == "__main__": jpayne@69: from Bio._utils import run_doctest jpayne@69: jpayne@69: run_doctest()