Mercurial > repos > rliterman > csp2
diff CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py Tue Mar 18 17:55:14 2025 -0400 @@ -0,0 +1,957 @@ +#!/usr/bin/env python +# Copyright 2010-2018 by Peter Cock. +# All rights reserved. +# +# This file is part of the Biopython distribution and governed by your +# choice of the "Biopython License Agreement" or the "BSD 3-Clause License". +# Please see the LICENSE file that should have been included as part of this +# package. +r"""Read and write BGZF compressed files (the GZIP variant used in BAM). + +The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text +format (SAM), and a compressed binary format (BAM). The latter uses a +modified form of gzip compression called BGZF (Blocked GNU Zip Format), +which can be applied to any file format to provide compression with +efficient random access. BGZF is described together with the SAM/BAM +file format at https://samtools.sourceforge.net/SAM1.pdf + +Please read the text below about 'virtual offsets' before using BGZF +files for random access. + +Aim of this module +------------------ +The Python gzip library can be used to read BGZF files, since for +decompression they are just (specialised) gzip files. What this +module aims to facilitate is random access to BGZF files (using the +'virtual offset' idea), and writing BGZF files (which means using +suitably sized gzip blocks and writing the extra 'BC' field in the +gzip headers). As in the gzip library, the zlib library is used +internally. + +In addition to being required for random access to and writing of +BAM files, the BGZF format can also be used on other sequential +data (in the sense of one record after another), such as most of +the sequence data formats supported in Bio.SeqIO (like FASTA, +FASTQ, GenBank, etc) or large MAF alignments. + +The Bio.SeqIO indexing functions use this module to support BGZF files. + +Technical Introduction to BGZF +------------------------------ +The gzip file format allows multiple compressed blocks, each of which +could be a stand alone gzip file. As an interesting bonus, this means +you can use Unix ``cat`` to combine two or more gzip files into one by +concatenating them. Also, each block can have one of several compression +levels (including uncompressed, which actually takes up a little bit +more space due to the gzip header). + +What the BAM designers realised was that while random access to data +stored in traditional gzip files was slow, breaking the file into +gzip blocks would allow fast random access to each block. To access +a particular piece of the decompressed data, you just need to know +which block it starts in (the offset of the gzip block start), and +how far into the (decompressed) contents of the block you need to +read. + +One problem with this is finding the gzip block sizes efficiently. +You can do it with a standard gzip file, but it requires every block +to be decompressed -- and that would be rather slow. Additionally +typical gzip files may use very large blocks. + +All that differs in BGZF is that compressed size of each gzip block +is limited to 2^16 bytes, and an extra 'BC' field in the gzip header +records this size. Traditional decompression tools can ignore this, +and unzip the file just like any other gzip file. + +The point of this is you can look at the first BGZF block, find out +how big it is from this 'BC' header, and thus seek immediately to +the second block, and so on. + +The BAM indexing scheme records read positions using a 64 bit +'virtual offset', comprising ``coffset << 16 | uoffset``, where ``coffset`` +is the file offset of the BGZF block containing the start of the read +(unsigned integer using up to 64-16 = 48 bits), and ``uoffset`` is the +offset within the (decompressed) block (unsigned 16 bit integer). + +This limits you to BAM files where the last block starts by 2^48 +bytes, or 256 petabytes, and the decompressed size of each block +is at most 2^16 bytes, or 64kb. Note that this matches the BGZF +'BC' field size which limits the compressed size of each block to +2^16 bytes, allowing for BAM files to use BGZF with no gzip +compression (useful for intermediate files in memory to reduce +CPU load). + +Warning about namespaces +------------------------ +It is considered a bad idea to use "from XXX import ``*``" in Python, because +it pollutes the namespace. This is a real issue with Bio.bgzf (and the +standard Python library gzip) because they contain a function called open +i.e. Suppose you do this: + +>>> from Bio.bgzf import * +>>> print(open.__module__) +Bio.bgzf + +Or, + +>>> from gzip import * +>>> print(open.__module__) +gzip + +Notice that the open function has been replaced. You can "fix" this if you +need to by importing the built-in open function: + +>>> from builtins import open + +However, what we recommend instead is to use the explicit namespace, e.g. + +>>> from Bio import bgzf +>>> print(bgzf.open.__module__) +Bio.bgzf + + +Examples +-------- +This is an ordinary GenBank file compressed using BGZF, so it can +be decompressed using gzip, + +>>> import gzip +>>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r") +>>> assert 0 == handle.tell() +>>> line = handle.readline() +>>> assert 80 == handle.tell() +>>> line = handle.readline() +>>> assert 143 == handle.tell() +>>> data = handle.read(70000) +>>> assert 70143 == handle.tell() +>>> handle.close() + +We can also access the file using the BGZF reader - but pay +attention to the file offsets which will be explained below: + +>>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r") +>>> assert 0 == handle.tell() +>>> print(handle.readline().rstrip()) +LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009 +>>> assert 80 == handle.tell() +>>> print(handle.readline().rstrip()) +DEFINITION Arabidopsis thaliana chloroplast, complete genome. +>>> assert 143 == handle.tell() +>>> data = handle.read(70000) +>>> assert 987828735 == handle.tell() +>>> print(handle.readline().rstrip()) +f="GeneID:844718" +>>> print(handle.readline().rstrip()) + CDS complement(join(84337..84771,85454..85843)) +>>> offset = handle.seek(make_virtual_offset(55074, 126)) +>>> print(handle.readline().rstrip()) + 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat +>>> handle.close() + +Notice the handle's offset looks different as a BGZF file. This +brings us to the key point about BGZF, which is the block structure: + +>>> handle = open("GenBank/NC_000932.gb.bgz", "rb") +>>> for values in BgzfBlocks(handle): +... print("Raw start %i, raw length %i; data start %i, data length %i" % values) +Raw start 0, raw length 15073; data start 0, data length 65536 +Raw start 15073, raw length 17857; data start 65536, data length 65536 +Raw start 32930, raw length 22144; data start 131072, data length 65536 +Raw start 55074, raw length 22230; data start 196608, data length 65536 +Raw start 77304, raw length 14939; data start 262144, data length 43478 +Raw start 92243, raw length 28; data start 305622, data length 0 +>>> handle.close() + +In this example the first three blocks are 'full' and hold 65536 bytes +of uncompressed data. The fourth block isn't full and holds 43478 bytes. +Finally there is a special empty fifth block which takes 28 bytes on +disk and serves as an 'end of file' (EOF) marker. If this is missing, +it is possible your BGZF file is incomplete. + +By reading ahead 70,000 bytes we moved into the second BGZF block, +and at that point the BGZF virtual offsets start to look different +to a simple offset into the decompressed data as exposed by the gzip +library. + +As an example, consider seeking to the decompressed position 196734. +Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this +is equivalent to jumping the first three blocks (which in this +specific example are all size 65536 after decompression - which +does not always hold) and starting at byte 126 of the fourth block +(after decompression). For BGZF, we need to know the fourth block's +offset of 55074 and the offset within the block of 126 to get the +BGZF virtual offset. + +>>> print(55074 << 16 | 126) +3609329790 +>>> print(bgzf.make_virtual_offset(55074, 126)) +3609329790 + +Thus for this BGZF file, decompressed position 196734 corresponds +to the virtual offset 3609329790. However, another BGZF file with +different contents would have compressed more or less efficiently, +so the compressed blocks would be different sizes. What this means +is the mapping between the uncompressed offset and the compressed +virtual offset depends on the BGZF file you are using. + +If you are accessing a BGZF file via this module, just use the +handle.tell() method to note the virtual offset of a position you +may later want to return to using handle.seek(). + +The catch with BGZF virtual offsets is while they can be compared +(which offset comes first in the file), you cannot safely subtract +them to get the size of the data between them, nor add/subtract +a relative offset. + +Of course you can parse this file with Bio.SeqIO using BgzfReader, +although there isn't any benefit over using gzip.open(...), unless +you want to index BGZF compressed sequence files: + +>>> from Bio import SeqIO +>>> handle = BgzfReader("GenBank/NC_000932.gb.bgz") +>>> record = SeqIO.read(handle, "genbank") +>>> handle.close() +>>> print(record.id) +NC_000932.1 + +Text Mode +--------- + +Like the standard library gzip.open(...), the BGZF code defaults to opening +files in binary mode. + +You can request the file be opened in text mode, but beware that this is hard +coded to the simple "latin1" (aka "iso-8859-1") encoding (which includes all +the ASCII characters), which works well with most Western European languages. +However, it is not fully compatible with the more widely used UTF-8 encoding. + +In variable width encodings like UTF-8, some single characters in the unicode +text output are represented by multiple bytes in the raw binary form. This is +problematic with BGZF, as we cannot always decode each block in isolation - a +single unicode character could be split over two blocks. This can even happen +with fixed width unicode encodings, as the BGZF block size is not fixed. + +Therefore, this module is currently restricted to only support single byte +unicode encodings, such as ASCII, "latin1" (which is a superset of ASCII), or +potentially other character maps (not implemented). + +Furthermore, unlike the default text mode on Python 3, we do not attempt to +implement universal new line mode. This transforms the various operating system +new line conventions like Windows (CR LF or "\r\n"), Unix (just LF, "\n"), or +old Macs (just CR, "\r"), into just LF ("\n"). Here we have the same problem - +is "\r" at the end of a block an incomplete Windows style new line? + +Instead, you will get the CR ("\r") and LF ("\n") characters as is. + +If your data is in UTF-8 or any other incompatible encoding, you must use +binary mode, and decode the appropriate fragments yourself. +""" + +import struct +import sys +import zlib + +from builtins import open as _open + +_bgzf_magic = b"\x1f\x8b\x08\x04" +_bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00" +_bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00" +_bytes_BC = b"BC" + + +def open(filename, mode="rb"): + r"""Open a BGZF file for reading, writing or appending. + + If text mode is requested, in order to avoid multi-byte characters, this is + hard coded to use the "latin1" encoding, and "\r" and "\n" are passed as is + (without implementing universal new line mode). + + If your data is in UTF-8 or any other incompatible encoding, you must use + binary mode, and decode the appropriate fragments yourself. + """ + if "r" in mode.lower(): + return BgzfReader(filename, mode) + elif "w" in mode.lower() or "a" in mode.lower(): + return BgzfWriter(filename, mode) + else: + raise ValueError(f"Bad mode {mode!r}") + + +def make_virtual_offset(block_start_offset, within_block_offset): + """Compute a BGZF virtual offset from block start and within block offsets. + + The BAM indexing scheme records read positions using a 64 bit + 'virtual offset', comprising in C terms: + + block_start_offset << 16 | within_block_offset + + Here block_start_offset is the file offset of the BGZF block + start (unsigned integer using up to 64-16 = 48 bits), and + within_block_offset within the (decompressed) block (unsigned + 16 bit integer). + + >>> make_virtual_offset(0, 0) + 0 + >>> make_virtual_offset(0, 1) + 1 + >>> make_virtual_offset(0, 2**16 - 1) + 65535 + >>> make_virtual_offset(0, 2**16) + Traceback (most recent call last): + ... + ValueError: Require 0 <= within_block_offset < 2**16, got 65536 + + >>> 65536 == make_virtual_offset(1, 0) + True + >>> 65537 == make_virtual_offset(1, 1) + True + >>> 131071 == make_virtual_offset(1, 2**16 - 1) + True + + >>> 6553600000 == make_virtual_offset(100000, 0) + True + >>> 6553600001 == make_virtual_offset(100000, 1) + True + >>> 6553600010 == make_virtual_offset(100000, 10) + True + + >>> make_virtual_offset(2**48, 0) + Traceback (most recent call last): + ... + ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656 + + """ + if within_block_offset < 0 or within_block_offset >= 65536: + raise ValueError( + "Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset + ) + if block_start_offset < 0 or block_start_offset >= 281474976710656: + raise ValueError( + "Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset + ) + return (block_start_offset << 16) | within_block_offset + + +def split_virtual_offset(virtual_offset): + """Divides a 64-bit BGZF virtual offset into block start & within block offsets. + + >>> (100000, 0) == split_virtual_offset(6553600000) + True + >>> (100000, 10) == split_virtual_offset(6553600010) + True + + """ + start = virtual_offset >> 16 + return start, virtual_offset ^ (start << 16) + + +def BgzfBlocks(handle): + """Low level debugging function to inspect BGZF blocks. + + Expects a BGZF compressed file opened in binary read mode using + the builtin open function. Do not use a handle from this bgzf + module or the gzip module's open function which will decompress + the file. + + Returns the block start offset (see virtual offsets), the block + length (add these for the start of the next block), and the + decompressed length of the blocks contents (limited to 65536 in + BGZF), as an iterator - one tuple per BGZF block. + + >>> from builtins import open + >>> handle = open("SamBam/ex1.bam", "rb") + >>> for values in BgzfBlocks(handle): + ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) + Raw start 0, raw length 18239; data start 0, data length 65536 + Raw start 18239, raw length 18223; data start 65536, data length 65536 + Raw start 36462, raw length 18017; data start 131072, data length 65536 + Raw start 54479, raw length 17342; data start 196608, data length 65536 + Raw start 71821, raw length 17715; data start 262144, data length 65536 + Raw start 89536, raw length 17728; data start 327680, data length 65536 + Raw start 107264, raw length 17292; data start 393216, data length 63398 + Raw start 124556, raw length 28; data start 456614, data length 0 + >>> handle.close() + + Indirectly we can tell this file came from an old version of + samtools because all the blocks (except the final one and the + dummy empty EOF marker block) are 65536 bytes. Later versions + avoid splitting a read between two blocks, and give the header + its own block (useful to speed up replacing the header). You + can see this in ex1_refresh.bam created using samtools 0.1.18: + + samtools view -b ex1.bam > ex1_refresh.bam + + >>> handle = open("SamBam/ex1_refresh.bam", "rb") + >>> for values in BgzfBlocks(handle): + ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) + Raw start 0, raw length 53; data start 0, data length 38 + Raw start 53, raw length 18195; data start 38, data length 65434 + Raw start 18248, raw length 18190; data start 65472, data length 65409 + Raw start 36438, raw length 18004; data start 130881, data length 65483 + Raw start 54442, raw length 17353; data start 196364, data length 65519 + Raw start 71795, raw length 17708; data start 261883, data length 65411 + Raw start 89503, raw length 17709; data start 327294, data length 65466 + Raw start 107212, raw length 17390; data start 392760, data length 63854 + Raw start 124602, raw length 28; data start 456614, data length 0 + >>> handle.close() + + The above example has no embedded SAM header (thus the first block + is very small at just 38 bytes of decompressed data), while the next + example does (a larger block of 103 bytes). Notice that the rest of + the blocks show the same sizes (they contain the same read data): + + >>> handle = open("SamBam/ex1_header.bam", "rb") + >>> for values in BgzfBlocks(handle): + ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) + Raw start 0, raw length 104; data start 0, data length 103 + Raw start 104, raw length 18195; data start 103, data length 65434 + Raw start 18299, raw length 18190; data start 65537, data length 65409 + Raw start 36489, raw length 18004; data start 130946, data length 65483 + Raw start 54493, raw length 17353; data start 196429, data length 65519 + Raw start 71846, raw length 17708; data start 261948, data length 65411 + Raw start 89554, raw length 17709; data start 327359, data length 65466 + Raw start 107263, raw length 17390; data start 392825, data length 63854 + Raw start 124653, raw length 28; data start 456679, data length 0 + >>> handle.close() + + """ + if isinstance(handle, BgzfReader): + raise TypeError("Function BgzfBlocks expects a binary handle") + data_start = 0 + while True: + start_offset = handle.tell() + try: + block_length, data = _load_bgzf_block(handle) + except StopIteration: + break + data_len = len(data) + yield start_offset, block_length, data_start, data_len + data_start += data_len + + +def _load_bgzf_block(handle, text_mode=False): + """Load the next BGZF block of compressed data (PRIVATE). + + Returns a tuple (block size and data), or at end of file + will raise StopIteration. + """ + magic = handle.read(4) + if not magic: + # End of file - should we signal this differently now? + # See https://www.python.org/dev/peps/pep-0479/ + raise StopIteration + if magic != _bgzf_magic: + raise ValueError( + r"A BGZF (e.g. a BAM file) block should start with " + r"%r, not %r; handle.tell() now says %r" + % (_bgzf_magic, magic, handle.tell()) + ) + gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = struct.unpack( + "<LBBH", handle.read(8) + ) + + block_size = None + x_len = 0 + while x_len < extra_len: + subfield_id = handle.read(2) + subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t + subfield_data = handle.read(subfield_len) + x_len += subfield_len + 4 + if subfield_id == _bytes_BC: + if subfield_len != 2: + raise ValueError("Wrong BC payload length") + if block_size is not None: + raise ValueError("Two BC subfields?") + block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t + if x_len != extra_len: + raise ValueError(f"x_len and extra_len differ {x_len}, {extra_len}") + if block_size is None: + raise ValueError("Missing BC, this isn't a BGZF file!") + # Now comes the compressed data, CRC, and length of uncompressed data. + deflate_size = block_size - 1 - extra_len - 19 + d = zlib.decompressobj(-15) # Negative window size means no headers + data = d.decompress(handle.read(deflate_size)) + d.flush() + expected_crc = handle.read(4) + expected_size = struct.unpack("<I", handle.read(4))[0] + if expected_size != len(data): + raise RuntimeError("Decompressed to %i, not %i" % (len(data), expected_size)) + # Should cope with a mix of Python platforms... + crc = zlib.crc32(data) + if crc < 0: + crc = struct.pack("<i", crc) + else: + crc = struct.pack("<I", crc) + if expected_crc != crc: + raise RuntimeError(f"CRC is {crc}, not {expected_crc}") + if text_mode: + # Note ISO-8859-1 aka Latin-1 preserves first 256 chars + # (i.e. ASCII), but critically is a single byte encoding + return block_size, data.decode("latin-1") + else: + return block_size, data + + +class BgzfReader: + r"""BGZF reader, acts like a read only handle but seek/tell differ. + + Let's use the BgzfBlocks function to have a peek at the BGZF blocks + in an example BAM file, + + >>> from builtins import open + >>> handle = open("SamBam/ex1.bam", "rb") + >>> for values in BgzfBlocks(handle): + ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) + Raw start 0, raw length 18239; data start 0, data length 65536 + Raw start 18239, raw length 18223; data start 65536, data length 65536 + Raw start 36462, raw length 18017; data start 131072, data length 65536 + Raw start 54479, raw length 17342; data start 196608, data length 65536 + Raw start 71821, raw length 17715; data start 262144, data length 65536 + Raw start 89536, raw length 17728; data start 327680, data length 65536 + Raw start 107264, raw length 17292; data start 393216, data length 63398 + Raw start 124556, raw length 28; data start 456614, data length 0 + >>> handle.close() + + Now let's see how to use this block information to jump to + specific parts of the decompressed BAM file: + + >>> handle = BgzfReader("SamBam/ex1.bam", "rb") + >>> assert 0 == handle.tell() + >>> magic = handle.read(4) + >>> assert 4 == handle.tell() + + So far nothing so strange, we got the magic marker used at the + start of a decompressed BAM file, and the handle position makes + sense. Now however, let's jump to the end of this block and 4 + bytes into the next block by reading 65536 bytes, + + >>> data = handle.read(65536) + >>> len(data) + 65536 + >>> assert 1195311108 == handle.tell() + + Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit + virtual offset, which means: + + >>> split_virtual_offset(1195311108) + (18239, 4) + + You should spot 18239 as the start of the second BGZF block, while + the 4 is the offset into this block. See also make_virtual_offset, + + >>> make_virtual_offset(18239, 4) + 1195311108 + + Let's jump back to almost the start of the file, + + >>> make_virtual_offset(0, 2) + 2 + >>> handle.seek(2) + 2 + >>> handle.close() + + Note that you can use the max_cache argument to limit the number of + BGZF blocks cached in memory. The default is 100, and since each + block can be up to 64kb, the default cache could take up to 6MB of + RAM. The cache is not important for reading through the file in one + pass, but is important for improving performance of random access. + """ + + def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100): + r"""Initialize the class for reading a BGZF file. + + You would typically use the top level ``bgzf.open(...)`` function + which will call this class internally. Direct use is discouraged. + + Either the ``filename`` (string) or ``fileobj`` (input file object in + binary mode) arguments must be supplied, but not both. + + Argument ``mode`` controls if the data will be returned as strings in + text mode ("rt", "tr", or default "r"), or bytes binary mode ("rb" + or "br"). The argument name matches the built-in ``open(...)`` and + standard library ``gzip.open(...)`` function. + + If text mode is requested, in order to avoid multi-byte characters, + this is hard coded to use the "latin1" encoding, and "\r" and "\n" + are passed as is (without implementing universal new line mode). There + is no ``encoding`` argument. + + If your data is in UTF-8 or any other incompatible encoding, you must + use binary mode, and decode the appropriate fragments yourself. + + Argument ``max_cache`` controls the maximum number of BGZF blocks to + cache in memory. Each can be up to 64kb thus the default of 100 blocks + could take up to 6MB of RAM. This is important for efficient random + access, a small value is fine for reading the file in one pass. + """ + # TODO - Assuming we can seek, check for 28 bytes EOF empty block + # and if missing warn about possible truncation (as in samtools)? + if max_cache < 1: + raise ValueError("Use max_cache with a minimum of 1") + # Must open the BGZF file in binary mode, but we may want to + # treat the contents as either text or binary (unicode or + # bytes under Python 3) + if filename and fileobj: + raise ValueError("Supply either filename or fileobj, not both") + # Want to reject output modes like w, a, x, + + if mode.lower() not in ("r", "tr", "rt", "rb", "br"): + raise ValueError( + "Must use a read mode like 'r' (default), 'rt', or 'rb' for binary" + ) + # If an open file was passed, make sure it was opened in binary mode. + if fileobj: + if fileobj.read(0) != b"": + raise ValueError("fileobj not opened in binary mode") + handle = fileobj + else: + handle = _open(filename, "rb") + self._text = "b" not in mode.lower() + if self._text: + self._newline = "\n" + else: + self._newline = b"\n" + self._handle = handle + self.max_cache = max_cache + self._buffers = {} + self._block_start_offset = None + self._block_raw_length = None + self._load_block(handle.tell()) + + def _load_block(self, start_offset=None): + if start_offset is None: + # If the file is being read sequentially, then _handle.tell() + # should be pointing at the start of the next block. + # However, if seek has been used, we can't assume that. + start_offset = self._block_start_offset + self._block_raw_length + if start_offset == self._block_start_offset: + self._within_block_offset = 0 + return + elif start_offset in self._buffers: + # Already in cache + self._buffer, self._block_raw_length = self._buffers[start_offset] + self._within_block_offset = 0 + self._block_start_offset = start_offset + return + # Must hit the disk... first check cache limits, + while len(self._buffers) >= self.max_cache: + # TODO - Implement LRU cache removal? + self._buffers.popitem() + # Now load the block + handle = self._handle + if start_offset is not None: + handle.seek(start_offset) + self._block_start_offset = handle.tell() + try: + block_size, self._buffer = _load_bgzf_block(handle, self._text) + except StopIteration: + # EOF + block_size = 0 + if self._text: + self._buffer = "" + else: + self._buffer = b"" + self._within_block_offset = 0 + self._block_raw_length = block_size + # Finally save the block in our cache, + self._buffers[self._block_start_offset] = self._buffer, block_size + + def tell(self): + """Return a 64-bit unsigned BGZF virtual offset.""" + if 0 < self._within_block_offset and self._within_block_offset == len( + self._buffer + ): + # Special case where we're right at the end of a (non empty) block. + # For non-maximal blocks could give two possible virtual offsets, + # but for a maximal block can't use 65536 as the within block + # offset. Therefore for consistency, use the next block and a + # within block offset of zero. + return (self._block_start_offset + self._block_raw_length) << 16 + else: + # return make_virtual_offset(self._block_start_offset, + # self._within_block_offset) + # TODO - Include bounds checking as in make_virtual_offset? + return (self._block_start_offset << 16) | self._within_block_offset + + def seek(self, virtual_offset): + """Seek to a 64-bit unsigned BGZF virtual offset.""" + # Do this inline to avoid a function call, + # start_offset, within_block = split_virtual_offset(virtual_offset) + start_offset = virtual_offset >> 16 + within_block = virtual_offset ^ (start_offset << 16) + if start_offset != self._block_start_offset: + # Don't need to load the block if already there + # (this avoids a function call since _load_block would do nothing) + self._load_block(start_offset) + if start_offset != self._block_start_offset: + raise ValueError("start_offset not loaded correctly") + if within_block > len(self._buffer): + if not (within_block == 0 and len(self._buffer) == 0): + raise ValueError( + "Within offset %i but block size only %i" + % (within_block, len(self._buffer)) + ) + self._within_block_offset = within_block + # assert virtual_offset == self.tell(), \ + # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \ + # % (virtual_offset, start_offset, within_block, + # self.tell(), self._block_start_offset, + # self._within_block_offset) + return virtual_offset + + def read(self, size=-1): + """Read method for the BGZF module.""" + if size < 0: + raise NotImplementedError("Don't be greedy, that could be massive!") + + result = "" if self._text else b"" + while size and self._block_raw_length: + if self._within_block_offset + size <= len(self._buffer): + # This may leave us right at the end of a block + # (lazy loading, don't load the next block unless we have too) + data = self._buffer[ + self._within_block_offset : self._within_block_offset + size + ] + self._within_block_offset += size + if not data: + raise ValueError("Must be at least 1 byte") + result += data + break + else: + data = self._buffer[self._within_block_offset :] + size -= len(data) + self._load_block() # will reset offsets + result += data + + return result + + def readline(self): + """Read a single line for the BGZF file.""" + result = "" if self._text else b"" + while self._block_raw_length: + i = self._buffer.find(self._newline, self._within_block_offset) + # Three cases to consider, + if i == -1: + # No newline, need to read in more data + data = self._buffer[self._within_block_offset :] + self._load_block() # will reset offsets + result += data + elif i + 1 == len(self._buffer): + # Found new line, but right at end of block (SPECIAL) + data = self._buffer[self._within_block_offset :] + # Must now load the next block to ensure tell() works + self._load_block() # will reset offsets + if not data: + raise ValueError("Must be at least 1 byte") + result += data + break + else: + # Found new line, not at end of block (easy case, no IO) + data = self._buffer[self._within_block_offset : i + 1] + self._within_block_offset = i + 1 + # assert data.endswith(self._newline) + result += data + break + + return result + + def __next__(self): + """Return the next line.""" + line = self.readline() + if not line: + raise StopIteration + return line + + def __iter__(self): + """Iterate over the lines in the BGZF file.""" + return self + + def close(self): + """Close BGZF file.""" + self._handle.close() + self._buffer = None + self._block_start_offset = None + self._buffers = None + + def seekable(self): + """Return True indicating the BGZF supports random access.""" + return True + + def isatty(self): + """Return True if connected to a TTY device.""" + return False + + def fileno(self): + """Return integer file descriptor.""" + return self._handle.fileno() + + def __enter__(self): + """Open a file operable with WITH statement.""" + return self + + def __exit__(self, type, value, traceback): + """Close a file with WITH statement.""" + self.close() + + +class BgzfWriter: + """Define a BGZFWriter object.""" + + def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6): + """Initilize the class.""" + if filename and fileobj: + raise ValueError("Supply either filename or fileobj, not both") + # If an open file was passed, make sure it was opened in binary mode. + if fileobj: + if fileobj.read(0) != b"": + raise ValueError("fileobj not opened in binary mode") + handle = fileobj + else: + if "w" not in mode.lower() and "a" not in mode.lower(): + raise ValueError(f"Must use write or append mode, not {mode!r}") + if "a" in mode.lower(): + handle = _open(filename, "ab") + else: + handle = _open(filename, "wb") + self._text = "b" not in mode.lower() + self._handle = handle + self._buffer = b"" + self.compresslevel = compresslevel + + def _write_block(self, block): + """Write provided data to file as a single BGZF compressed block (PRIVATE).""" + # print("Saving %i bytes" % len(block)) + if len(block) > 65536: + raise ValueError(f"{len(block)} Block length > 65536") + # Giving a negative window bits means no gzip/zlib headers, + # -15 used in samtools + c = zlib.compressobj( + self.compresslevel, zlib.DEFLATED, -15, zlib.DEF_MEM_LEVEL, 0 + ) + compressed = c.compress(block) + c.flush() + del c + if len(compressed) > 65536: + raise RuntimeError( + "TODO - Didn't compress enough, try less data in this block" + ) + crc = zlib.crc32(block) + # Should cope with a mix of Python platforms... + if crc < 0: + crc = struct.pack("<i", crc) + else: + crc = struct.pack("<I", crc) + bsize = struct.pack("<H", len(compressed) + 25) # includes -1 + crc = struct.pack("<I", zlib.crc32(block) & 0xFFFFFFFF) + uncompressed_length = struct.pack("<I", len(block)) + # Fixed 16 bytes, + # gzip magic bytes (4) mod time (4), + # gzip flag (1), os (1), extra length which is six (2), + # sub field which is BC (2), sub field length of two (2), + # Variable data, + # 2 bytes: block length as BC sub field (2) + # X bytes: the data + # 8 bytes: crc (4), uncompressed data length (4) + data = _bgzf_header + bsize + compressed + crc + uncompressed_length + self._handle.write(data) + + def write(self, data): + """Write method for the class.""" + # TODO - Check bytes vs unicode + if isinstance(data, str): + # When reading we can't cope with multi-byte characters + # being split between BGZF blocks, so we restrict to a + # single byte encoding - like ASCII or latin-1. + # On output we could probably allow any encoding, as we + # don't care about splitting unicode characters between blocks + data = data.encode("latin-1") + # block_size = 2**16 = 65536 + data_len = len(data) + if len(self._buffer) + data_len < 65536: + # print("Cached %r" % data) + self._buffer += data + else: + # print("Got %r, writing out some data..." % data) + self._buffer += data + while len(self._buffer) >= 65536: + self._write_block(self._buffer[:65536]) + self._buffer = self._buffer[65536:] + + def flush(self): + """Flush data explicitally.""" + while len(self._buffer) >= 65536: + self._write_block(self._buffer[:65535]) + self._buffer = self._buffer[65535:] + self._write_block(self._buffer) + self._buffer = b"" + self._handle.flush() + + def close(self): + """Flush data, write 28 bytes BGZF EOF marker, and close BGZF file. + + samtools will look for a magic EOF marker, just a 28 byte empty BGZF + block, and if it is missing warns the BAM file may be truncated. In + addition to samtools writing this block, so too does bgzip - so this + implementation does too. + """ + if self._buffer: + self.flush() + self._handle.write(_bgzf_eof) + self._handle.flush() + self._handle.close() + + def tell(self): + """Return a BGZF 64-bit virtual offset.""" + return make_virtual_offset(self._handle.tell(), len(self._buffer)) + + def seekable(self): + """Return True indicating the BGZF supports random access.""" + # Not seekable, but we do support tell... + return False + + def isatty(self): + """Return True if connected to a TTY device.""" + return False + + def fileno(self): + """Return integer file descriptor.""" + return self._handle.fileno() + + def __enter__(self): + """Open a file operable with WITH statement.""" + return self + + def __exit__(self, type, value, traceback): + """Close a file with WITH statement.""" + self.close() + + +if __name__ == "__main__": + if len(sys.argv) > 1: + print("Call this with no arguments and pipe uncompressed data in on stdin") + print("and it will produce BGZF compressed data on stdout. e.g.") + print("") + print("./bgzf.py < example.fastq > example.fastq.bgz") + print("") + print("The extension convention of *.bgz is to distinugish these from *.gz") + print("used for standard gzipped files without the block structure of BGZF.") + print("You can use the standard gunzip command to decompress BGZF files,") + print("if it complains about the extension try something like this:") + print("") + print("cat example.fastq.bgz | gunzip > example.fastq") + print("") + print("See also the tool bgzip that comes with samtools") + sys.exit(0) + + # Ensure we have binary mode handles + # (leave stderr as default text mode) + stdin = sys.stdin.buffer + stdout = sys.stdout.buffer + + sys.stderr.write("Producing BGZF output from stdin...\n") + w = BgzfWriter(fileobj=stdout) + while True: + data = stdin.read(65536) + w.write(data) + if not data: + break + # Doing close will write an empty BGZF block as EOF marker: + w.close() + sys.stderr.write("BGZF data produced\n")