jpayne@68: #!/usr/bin/env python jpayne@68: # Copyright 2010-2018 by Peter Cock. jpayne@68: # All rights reserved. jpayne@68: # jpayne@68: # This file is part of the Biopython distribution and governed by your jpayne@68: # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". jpayne@68: # Please see the LICENSE file that should have been included as part of this jpayne@68: # package. jpayne@68: r"""Read and write BGZF compressed files (the GZIP variant used in BAM). jpayne@68: jpayne@68: The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text jpayne@68: format (SAM), and a compressed binary format (BAM). The latter uses a jpayne@68: modified form of gzip compression called BGZF (Blocked GNU Zip Format), jpayne@68: which can be applied to any file format to provide compression with jpayne@68: efficient random access. BGZF is described together with the SAM/BAM jpayne@68: file format at https://samtools.sourceforge.net/SAM1.pdf jpayne@68: jpayne@68: Please read the text below about 'virtual offsets' before using BGZF jpayne@68: files for random access. jpayne@68: jpayne@68: Aim of this module jpayne@68: ------------------ jpayne@68: The Python gzip library can be used to read BGZF files, since for jpayne@68: decompression they are just (specialised) gzip files. What this jpayne@68: module aims to facilitate is random access to BGZF files (using the jpayne@68: 'virtual offset' idea), and writing BGZF files (which means using jpayne@68: suitably sized gzip blocks and writing the extra 'BC' field in the jpayne@68: gzip headers). As in the gzip library, the zlib library is used jpayne@68: internally. jpayne@68: jpayne@68: In addition to being required for random access to and writing of jpayne@68: BAM files, the BGZF format can also be used on other sequential jpayne@68: data (in the sense of one record after another), such as most of jpayne@68: the sequence data formats supported in Bio.SeqIO (like FASTA, jpayne@68: FASTQ, GenBank, etc) or large MAF alignments. jpayne@68: jpayne@68: The Bio.SeqIO indexing functions use this module to support BGZF files. jpayne@68: jpayne@68: Technical Introduction to BGZF jpayne@68: ------------------------------ jpayne@68: The gzip file format allows multiple compressed blocks, each of which jpayne@68: could be a stand alone gzip file. As an interesting bonus, this means jpayne@68: you can use Unix ``cat`` to combine two or more gzip files into one by jpayne@68: concatenating them. Also, each block can have one of several compression jpayne@68: levels (including uncompressed, which actually takes up a little bit jpayne@68: more space due to the gzip header). jpayne@68: jpayne@68: What the BAM designers realised was that while random access to data jpayne@68: stored in traditional gzip files was slow, breaking the file into jpayne@68: gzip blocks would allow fast random access to each block. To access jpayne@68: a particular piece of the decompressed data, you just need to know jpayne@68: which block it starts in (the offset of the gzip block start), and jpayne@68: how far into the (decompressed) contents of the block you need to jpayne@68: read. jpayne@68: jpayne@68: One problem with this is finding the gzip block sizes efficiently. jpayne@68: You can do it with a standard gzip file, but it requires every block jpayne@68: to be decompressed -- and that would be rather slow. Additionally jpayne@68: typical gzip files may use very large blocks. jpayne@68: jpayne@68: All that differs in BGZF is that compressed size of each gzip block jpayne@68: is limited to 2^16 bytes, and an extra 'BC' field in the gzip header jpayne@68: records this size. Traditional decompression tools can ignore this, jpayne@68: and unzip the file just like any other gzip file. jpayne@68: jpayne@68: The point of this is you can look at the first BGZF block, find out jpayne@68: how big it is from this 'BC' header, and thus seek immediately to jpayne@68: the second block, and so on. jpayne@68: jpayne@68: The BAM indexing scheme records read positions using a 64 bit jpayne@68: 'virtual offset', comprising ``coffset << 16 | uoffset``, where ``coffset`` jpayne@68: is the file offset of the BGZF block containing the start of the read jpayne@68: (unsigned integer using up to 64-16 = 48 bits), and ``uoffset`` is the jpayne@68: offset within the (decompressed) block (unsigned 16 bit integer). jpayne@68: jpayne@68: This limits you to BAM files where the last block starts by 2^48 jpayne@68: bytes, or 256 petabytes, and the decompressed size of each block jpayne@68: is at most 2^16 bytes, or 64kb. Note that this matches the BGZF jpayne@68: 'BC' field size which limits the compressed size of each block to jpayne@68: 2^16 bytes, allowing for BAM files to use BGZF with no gzip jpayne@68: compression (useful for intermediate files in memory to reduce jpayne@68: CPU load). jpayne@68: jpayne@68: Warning about namespaces jpayne@68: ------------------------ jpayne@68: It is considered a bad idea to use "from XXX import ``*``" in Python, because jpayne@68: it pollutes the namespace. This is a real issue with Bio.bgzf (and the jpayne@68: standard Python library gzip) because they contain a function called open jpayne@68: i.e. Suppose you do this: jpayne@68: jpayne@68: >>> from Bio.bgzf import * jpayne@68: >>> print(open.__module__) jpayne@68: Bio.bgzf jpayne@68: jpayne@68: Or, jpayne@68: jpayne@68: >>> from gzip import * jpayne@68: >>> print(open.__module__) jpayne@68: gzip jpayne@68: jpayne@68: Notice that the open function has been replaced. You can "fix" this if you jpayne@68: need to by importing the built-in open function: jpayne@68: jpayne@68: >>> from builtins import open jpayne@68: jpayne@68: However, what we recommend instead is to use the explicit namespace, e.g. jpayne@68: jpayne@68: >>> from Bio import bgzf jpayne@68: >>> print(bgzf.open.__module__) jpayne@68: Bio.bgzf jpayne@68: jpayne@68: jpayne@68: Examples jpayne@68: -------- jpayne@68: This is an ordinary GenBank file compressed using BGZF, so it can jpayne@68: be decompressed using gzip, jpayne@68: jpayne@68: >>> import gzip jpayne@68: >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r") jpayne@68: >>> assert 0 == handle.tell() jpayne@68: >>> line = handle.readline() jpayne@68: >>> assert 80 == handle.tell() jpayne@68: >>> line = handle.readline() jpayne@68: >>> assert 143 == handle.tell() jpayne@68: >>> data = handle.read(70000) jpayne@68: >>> assert 70143 == handle.tell() jpayne@68: >>> handle.close() jpayne@68: jpayne@68: We can also access the file using the BGZF reader - but pay jpayne@68: attention to the file offsets which will be explained below: jpayne@68: jpayne@68: >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r") jpayne@68: >>> assert 0 == handle.tell() jpayne@68: >>> print(handle.readline().rstrip()) jpayne@68: LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009 jpayne@68: >>> assert 80 == handle.tell() jpayne@68: >>> print(handle.readline().rstrip()) jpayne@68: DEFINITION Arabidopsis thaliana chloroplast, complete genome. jpayne@68: >>> assert 143 == handle.tell() jpayne@68: >>> data = handle.read(70000) jpayne@68: >>> assert 987828735 == handle.tell() jpayne@68: >>> print(handle.readline().rstrip()) jpayne@68: f="GeneID:844718" jpayne@68: >>> print(handle.readline().rstrip()) jpayne@68: CDS complement(join(84337..84771,85454..85843)) jpayne@68: >>> offset = handle.seek(make_virtual_offset(55074, 126)) jpayne@68: >>> print(handle.readline().rstrip()) jpayne@68: 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat jpayne@68: >>> handle.close() jpayne@68: jpayne@68: Notice the handle's offset looks different as a BGZF file. This jpayne@68: brings us to the key point about BGZF, which is the block structure: jpayne@68: jpayne@68: >>> handle = open("GenBank/NC_000932.gb.bgz", "rb") jpayne@68: >>> for values in BgzfBlocks(handle): jpayne@68: ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) jpayne@68: Raw start 0, raw length 15073; data start 0, data length 65536 jpayne@68: Raw start 15073, raw length 17857; data start 65536, data length 65536 jpayne@68: Raw start 32930, raw length 22144; data start 131072, data length 65536 jpayne@68: Raw start 55074, raw length 22230; data start 196608, data length 65536 jpayne@68: Raw start 77304, raw length 14939; data start 262144, data length 43478 jpayne@68: Raw start 92243, raw length 28; data start 305622, data length 0 jpayne@68: >>> handle.close() jpayne@68: jpayne@68: In this example the first three blocks are 'full' and hold 65536 bytes jpayne@68: of uncompressed data. The fourth block isn't full and holds 43478 bytes. jpayne@68: Finally there is a special empty fifth block which takes 28 bytes on jpayne@68: disk and serves as an 'end of file' (EOF) marker. If this is missing, jpayne@68: it is possible your BGZF file is incomplete. jpayne@68: jpayne@68: By reading ahead 70,000 bytes we moved into the second BGZF block, jpayne@68: and at that point the BGZF virtual offsets start to look different jpayne@68: to a simple offset into the decompressed data as exposed by the gzip jpayne@68: library. jpayne@68: jpayne@68: As an example, consider seeking to the decompressed position 196734. jpayne@68: Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this jpayne@68: is equivalent to jumping the first three blocks (which in this jpayne@68: specific example are all size 65536 after decompression - which jpayne@68: does not always hold) and starting at byte 126 of the fourth block jpayne@68: (after decompression). For BGZF, we need to know the fourth block's jpayne@68: offset of 55074 and the offset within the block of 126 to get the jpayne@68: BGZF virtual offset. jpayne@68: jpayne@68: >>> print(55074 << 16 | 126) jpayne@68: 3609329790 jpayne@68: >>> print(bgzf.make_virtual_offset(55074, 126)) jpayne@68: 3609329790 jpayne@68: jpayne@68: Thus for this BGZF file, decompressed position 196734 corresponds jpayne@68: to the virtual offset 3609329790. However, another BGZF file with jpayne@68: different contents would have compressed more or less efficiently, jpayne@68: so the compressed blocks would be different sizes. What this means jpayne@68: is the mapping between the uncompressed offset and the compressed jpayne@68: virtual offset depends on the BGZF file you are using. jpayne@68: jpayne@68: If you are accessing a BGZF file via this module, just use the jpayne@68: handle.tell() method to note the virtual offset of a position you jpayne@68: may later want to return to using handle.seek(). jpayne@68: jpayne@68: The catch with BGZF virtual offsets is while they can be compared jpayne@68: (which offset comes first in the file), you cannot safely subtract jpayne@68: them to get the size of the data between them, nor add/subtract jpayne@68: a relative offset. jpayne@68: jpayne@68: Of course you can parse this file with Bio.SeqIO using BgzfReader, jpayne@68: although there isn't any benefit over using gzip.open(...), unless jpayne@68: you want to index BGZF compressed sequence files: jpayne@68: jpayne@68: >>> from Bio import SeqIO jpayne@68: >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz") jpayne@68: >>> record = SeqIO.read(handle, "genbank") jpayne@68: >>> handle.close() jpayne@68: >>> print(record.id) jpayne@68: NC_000932.1 jpayne@68: jpayne@68: Text Mode jpayne@68: --------- jpayne@68: jpayne@68: Like the standard library gzip.open(...), the BGZF code defaults to opening jpayne@68: files in binary mode. jpayne@68: jpayne@68: You can request the file be opened in text mode, but beware that this is hard jpayne@68: coded to the simple "latin1" (aka "iso-8859-1") encoding (which includes all jpayne@68: the ASCII characters), which works well with most Western European languages. jpayne@68: However, it is not fully compatible with the more widely used UTF-8 encoding. jpayne@68: jpayne@68: In variable width encodings like UTF-8, some single characters in the unicode jpayne@68: text output are represented by multiple bytes in the raw binary form. This is jpayne@68: problematic with BGZF, as we cannot always decode each block in isolation - a jpayne@68: single unicode character could be split over two blocks. This can even happen jpayne@68: with fixed width unicode encodings, as the BGZF block size is not fixed. jpayne@68: jpayne@68: Therefore, this module is currently restricted to only support single byte jpayne@68: unicode encodings, such as ASCII, "latin1" (which is a superset of ASCII), or jpayne@68: potentially other character maps (not implemented). jpayne@68: jpayne@68: Furthermore, unlike the default text mode on Python 3, we do not attempt to jpayne@68: implement universal new line mode. This transforms the various operating system jpayne@68: new line conventions like Windows (CR LF or "\r\n"), Unix (just LF, "\n"), or jpayne@68: old Macs (just CR, "\r"), into just LF ("\n"). Here we have the same problem - jpayne@68: is "\r" at the end of a block an incomplete Windows style new line? jpayne@68: jpayne@68: Instead, you will get the CR ("\r") and LF ("\n") characters as is. jpayne@68: jpayne@68: If your data is in UTF-8 or any other incompatible encoding, you must use jpayne@68: binary mode, and decode the appropriate fragments yourself. jpayne@68: """ jpayne@68: jpayne@68: import struct jpayne@68: import sys jpayne@68: import zlib jpayne@68: jpayne@68: from builtins import open as _open jpayne@68: jpayne@68: _bgzf_magic = b"\x1f\x8b\x08\x04" jpayne@68: _bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00" jpayne@68: _bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00" jpayne@68: _bytes_BC = b"BC" jpayne@68: jpayne@68: jpayne@68: def open(filename, mode="rb"): jpayne@68: r"""Open a BGZF file for reading, writing or appending. jpayne@68: jpayne@68: If text mode is requested, in order to avoid multi-byte characters, this is jpayne@68: hard coded to use the "latin1" encoding, and "\r" and "\n" are passed as is jpayne@68: (without implementing universal new line mode). jpayne@68: jpayne@68: If your data is in UTF-8 or any other incompatible encoding, you must use jpayne@68: binary mode, and decode the appropriate fragments yourself. jpayne@68: """ jpayne@68: if "r" in mode.lower(): jpayne@68: return BgzfReader(filename, mode) jpayne@68: elif "w" in mode.lower() or "a" in mode.lower(): jpayne@68: return BgzfWriter(filename, mode) jpayne@68: else: jpayne@68: raise ValueError(f"Bad mode {mode!r}") jpayne@68: jpayne@68: jpayne@68: def make_virtual_offset(block_start_offset, within_block_offset): jpayne@68: """Compute a BGZF virtual offset from block start and within block offsets. jpayne@68: jpayne@68: The BAM indexing scheme records read positions using a 64 bit jpayne@68: 'virtual offset', comprising in C terms: jpayne@68: jpayne@68: block_start_offset << 16 | within_block_offset jpayne@68: jpayne@68: Here block_start_offset is the file offset of the BGZF block jpayne@68: start (unsigned integer using up to 64-16 = 48 bits), and jpayne@68: within_block_offset within the (decompressed) block (unsigned jpayne@68: 16 bit integer). jpayne@68: jpayne@68: >>> make_virtual_offset(0, 0) jpayne@68: 0 jpayne@68: >>> make_virtual_offset(0, 1) jpayne@68: 1 jpayne@68: >>> make_virtual_offset(0, 2**16 - 1) jpayne@68: 65535 jpayne@68: >>> make_virtual_offset(0, 2**16) jpayne@68: Traceback (most recent call last): jpayne@68: ... jpayne@68: ValueError: Require 0 <= within_block_offset < 2**16, got 65536 jpayne@68: jpayne@68: >>> 65536 == make_virtual_offset(1, 0) jpayne@68: True jpayne@68: >>> 65537 == make_virtual_offset(1, 1) jpayne@68: True jpayne@68: >>> 131071 == make_virtual_offset(1, 2**16 - 1) jpayne@68: True jpayne@68: jpayne@68: >>> 6553600000 == make_virtual_offset(100000, 0) jpayne@68: True jpayne@68: >>> 6553600001 == make_virtual_offset(100000, 1) jpayne@68: True jpayne@68: >>> 6553600010 == make_virtual_offset(100000, 10) jpayne@68: True jpayne@68: jpayne@68: >>> make_virtual_offset(2**48, 0) jpayne@68: Traceback (most recent call last): jpayne@68: ... jpayne@68: ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656 jpayne@68: jpayne@68: """ jpayne@68: if within_block_offset < 0 or within_block_offset >= 65536: jpayne@68: raise ValueError( jpayne@68: "Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset jpayne@68: ) jpayne@68: if block_start_offset < 0 or block_start_offset >= 281474976710656: jpayne@68: raise ValueError( jpayne@68: "Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset jpayne@68: ) jpayne@68: return (block_start_offset << 16) | within_block_offset jpayne@68: jpayne@68: jpayne@68: def split_virtual_offset(virtual_offset): jpayne@68: """Divides a 64-bit BGZF virtual offset into block start & within block offsets. jpayne@68: jpayne@68: >>> (100000, 0) == split_virtual_offset(6553600000) jpayne@68: True jpayne@68: >>> (100000, 10) == split_virtual_offset(6553600010) jpayne@68: True jpayne@68: jpayne@68: """ jpayne@68: start = virtual_offset >> 16 jpayne@68: return start, virtual_offset ^ (start << 16) jpayne@68: jpayne@68: jpayne@68: def BgzfBlocks(handle): jpayne@68: """Low level debugging function to inspect BGZF blocks. jpayne@68: jpayne@68: Expects a BGZF compressed file opened in binary read mode using jpayne@68: the builtin open function. Do not use a handle from this bgzf jpayne@68: module or the gzip module's open function which will decompress jpayne@68: the file. jpayne@68: jpayne@68: Returns the block start offset (see virtual offsets), the block jpayne@68: length (add these for the start of the next block), and the jpayne@68: decompressed length of the blocks contents (limited to 65536 in jpayne@68: BGZF), as an iterator - one tuple per BGZF block. jpayne@68: jpayne@68: >>> from builtins import open jpayne@68: >>> handle = open("SamBam/ex1.bam", "rb") jpayne@68: >>> for values in BgzfBlocks(handle): jpayne@68: ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) jpayne@68: Raw start 0, raw length 18239; data start 0, data length 65536 jpayne@68: Raw start 18239, raw length 18223; data start 65536, data length 65536 jpayne@68: Raw start 36462, raw length 18017; data start 131072, data length 65536 jpayne@68: Raw start 54479, raw length 17342; data start 196608, data length 65536 jpayne@68: Raw start 71821, raw length 17715; data start 262144, data length 65536 jpayne@68: Raw start 89536, raw length 17728; data start 327680, data length 65536 jpayne@68: Raw start 107264, raw length 17292; data start 393216, data length 63398 jpayne@68: Raw start 124556, raw length 28; data start 456614, data length 0 jpayne@68: >>> handle.close() jpayne@68: jpayne@68: Indirectly we can tell this file came from an old version of jpayne@68: samtools because all the blocks (except the final one and the jpayne@68: dummy empty EOF marker block) are 65536 bytes. Later versions jpayne@68: avoid splitting a read between two blocks, and give the header jpayne@68: its own block (useful to speed up replacing the header). You jpayne@68: can see this in ex1_refresh.bam created using samtools 0.1.18: jpayne@68: jpayne@68: samtools view -b ex1.bam > ex1_refresh.bam jpayne@68: jpayne@68: >>> handle = open("SamBam/ex1_refresh.bam", "rb") jpayne@68: >>> for values in BgzfBlocks(handle): jpayne@68: ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) jpayne@68: Raw start 0, raw length 53; data start 0, data length 38 jpayne@68: Raw start 53, raw length 18195; data start 38, data length 65434 jpayne@68: Raw start 18248, raw length 18190; data start 65472, data length 65409 jpayne@68: Raw start 36438, raw length 18004; data start 130881, data length 65483 jpayne@68: Raw start 54442, raw length 17353; data start 196364, data length 65519 jpayne@68: Raw start 71795, raw length 17708; data start 261883, data length 65411 jpayne@68: Raw start 89503, raw length 17709; data start 327294, data length 65466 jpayne@68: Raw start 107212, raw length 17390; data start 392760, data length 63854 jpayne@68: Raw start 124602, raw length 28; data start 456614, data length 0 jpayne@68: >>> handle.close() jpayne@68: jpayne@68: The above example has no embedded SAM header (thus the first block jpayne@68: is very small at just 38 bytes of decompressed data), while the next jpayne@68: example does (a larger block of 103 bytes). Notice that the rest of jpayne@68: the blocks show the same sizes (they contain the same read data): jpayne@68: jpayne@68: >>> handle = open("SamBam/ex1_header.bam", "rb") jpayne@68: >>> for values in BgzfBlocks(handle): jpayne@68: ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) jpayne@68: Raw start 0, raw length 104; data start 0, data length 103 jpayne@68: Raw start 104, raw length 18195; data start 103, data length 65434 jpayne@68: Raw start 18299, raw length 18190; data start 65537, data length 65409 jpayne@68: Raw start 36489, raw length 18004; data start 130946, data length 65483 jpayne@68: Raw start 54493, raw length 17353; data start 196429, data length 65519 jpayne@68: Raw start 71846, raw length 17708; data start 261948, data length 65411 jpayne@68: Raw start 89554, raw length 17709; data start 327359, data length 65466 jpayne@68: Raw start 107263, raw length 17390; data start 392825, data length 63854 jpayne@68: Raw start 124653, raw length 28; data start 456679, data length 0 jpayne@68: >>> handle.close() jpayne@68: jpayne@68: """ jpayne@68: if isinstance(handle, BgzfReader): jpayne@68: raise TypeError("Function BgzfBlocks expects a binary handle") jpayne@68: data_start = 0 jpayne@68: while True: jpayne@68: start_offset = handle.tell() jpayne@68: try: jpayne@68: block_length, data = _load_bgzf_block(handle) jpayne@68: except StopIteration: jpayne@68: break jpayne@68: data_len = len(data) jpayne@68: yield start_offset, block_length, data_start, data_len jpayne@68: data_start += data_len jpayne@68: jpayne@68: jpayne@68: def _load_bgzf_block(handle, text_mode=False): jpayne@68: """Load the next BGZF block of compressed data (PRIVATE). jpayne@68: jpayne@68: Returns a tuple (block size and data), or at end of file jpayne@68: will raise StopIteration. jpayne@68: """ jpayne@68: magic = handle.read(4) jpayne@68: if not magic: jpayne@68: # End of file - should we signal this differently now? jpayne@68: # See https://www.python.org/dev/peps/pep-0479/ jpayne@68: raise StopIteration jpayne@68: if magic != _bgzf_magic: jpayne@68: raise ValueError( jpayne@68: r"A BGZF (e.g. a BAM file) block should start with " jpayne@68: r"%r, not %r; handle.tell() now says %r" jpayne@68: % (_bgzf_magic, magic, handle.tell()) jpayne@68: ) jpayne@68: gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = struct.unpack( jpayne@68: ">> from builtins import open jpayne@68: >>> handle = open("SamBam/ex1.bam", "rb") jpayne@68: >>> for values in BgzfBlocks(handle): jpayne@68: ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) jpayne@68: Raw start 0, raw length 18239; data start 0, data length 65536 jpayne@68: Raw start 18239, raw length 18223; data start 65536, data length 65536 jpayne@68: Raw start 36462, raw length 18017; data start 131072, data length 65536 jpayne@68: Raw start 54479, raw length 17342; data start 196608, data length 65536 jpayne@68: Raw start 71821, raw length 17715; data start 262144, data length 65536 jpayne@68: Raw start 89536, raw length 17728; data start 327680, data length 65536 jpayne@68: Raw start 107264, raw length 17292; data start 393216, data length 63398 jpayne@68: Raw start 124556, raw length 28; data start 456614, data length 0 jpayne@68: >>> handle.close() jpayne@68: jpayne@68: Now let's see how to use this block information to jump to jpayne@68: specific parts of the decompressed BAM file: jpayne@68: jpayne@68: >>> handle = BgzfReader("SamBam/ex1.bam", "rb") jpayne@68: >>> assert 0 == handle.tell() jpayne@68: >>> magic = handle.read(4) jpayne@68: >>> assert 4 == handle.tell() jpayne@68: jpayne@68: So far nothing so strange, we got the magic marker used at the jpayne@68: start of a decompressed BAM file, and the handle position makes jpayne@68: sense. Now however, let's jump to the end of this block and 4 jpayne@68: bytes into the next block by reading 65536 bytes, jpayne@68: jpayne@68: >>> data = handle.read(65536) jpayne@68: >>> len(data) jpayne@68: 65536 jpayne@68: >>> assert 1195311108 == handle.tell() jpayne@68: jpayne@68: Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit jpayne@68: virtual offset, which means: jpayne@68: jpayne@68: >>> split_virtual_offset(1195311108) jpayne@68: (18239, 4) jpayne@68: jpayne@68: You should spot 18239 as the start of the second BGZF block, while jpayne@68: the 4 is the offset into this block. See also make_virtual_offset, jpayne@68: jpayne@68: >>> make_virtual_offset(18239, 4) jpayne@68: 1195311108 jpayne@68: jpayne@68: Let's jump back to almost the start of the file, jpayne@68: jpayne@68: >>> make_virtual_offset(0, 2) jpayne@68: 2 jpayne@68: >>> handle.seek(2) jpayne@68: 2 jpayne@68: >>> handle.close() jpayne@68: jpayne@68: Note that you can use the max_cache argument to limit the number of jpayne@68: BGZF blocks cached in memory. The default is 100, and since each jpayne@68: block can be up to 64kb, the default cache could take up to 6MB of jpayne@68: RAM. The cache is not important for reading through the file in one jpayne@68: pass, but is important for improving performance of random access. jpayne@68: """ jpayne@68: jpayne@68: def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100): jpayne@68: r"""Initialize the class for reading a BGZF file. jpayne@68: jpayne@68: You would typically use the top level ``bgzf.open(...)`` function jpayne@68: which will call this class internally. Direct use is discouraged. jpayne@68: jpayne@68: Either the ``filename`` (string) or ``fileobj`` (input file object in jpayne@68: binary mode) arguments must be supplied, but not both. jpayne@68: jpayne@68: Argument ``mode`` controls if the data will be returned as strings in jpayne@68: text mode ("rt", "tr", or default "r"), or bytes binary mode ("rb" jpayne@68: or "br"). The argument name matches the built-in ``open(...)`` and jpayne@68: standard library ``gzip.open(...)`` function. jpayne@68: jpayne@68: If text mode is requested, in order to avoid multi-byte characters, jpayne@68: this is hard coded to use the "latin1" encoding, and "\r" and "\n" jpayne@68: are passed as is (without implementing universal new line mode). There jpayne@68: is no ``encoding`` argument. jpayne@68: jpayne@68: If your data is in UTF-8 or any other incompatible encoding, you must jpayne@68: use binary mode, and decode the appropriate fragments yourself. jpayne@68: jpayne@68: Argument ``max_cache`` controls the maximum number of BGZF blocks to jpayne@68: cache in memory. Each can be up to 64kb thus the default of 100 blocks jpayne@68: could take up to 6MB of RAM. This is important for efficient random jpayne@68: access, a small value is fine for reading the file in one pass. jpayne@68: """ jpayne@68: # TODO - Assuming we can seek, check for 28 bytes EOF empty block jpayne@68: # and if missing warn about possible truncation (as in samtools)? jpayne@68: if max_cache < 1: jpayne@68: raise ValueError("Use max_cache with a minimum of 1") jpayne@68: # Must open the BGZF file in binary mode, but we may want to jpayne@68: # treat the contents as either text or binary (unicode or jpayne@68: # bytes under Python 3) jpayne@68: if filename and fileobj: jpayne@68: raise ValueError("Supply either filename or fileobj, not both") jpayne@68: # Want to reject output modes like w, a, x, + jpayne@68: if mode.lower() not in ("r", "tr", "rt", "rb", "br"): jpayne@68: raise ValueError( jpayne@68: "Must use a read mode like 'r' (default), 'rt', or 'rb' for binary" jpayne@68: ) jpayne@68: # If an open file was passed, make sure it was opened in binary mode. jpayne@68: if fileobj: jpayne@68: if fileobj.read(0) != b"": jpayne@68: raise ValueError("fileobj not opened in binary mode") jpayne@68: handle = fileobj jpayne@68: else: jpayne@68: handle = _open(filename, "rb") jpayne@68: self._text = "b" not in mode.lower() jpayne@68: if self._text: jpayne@68: self._newline = "\n" jpayne@68: else: jpayne@68: self._newline = b"\n" jpayne@68: self._handle = handle jpayne@68: self.max_cache = max_cache jpayne@68: self._buffers = {} jpayne@68: self._block_start_offset = None jpayne@68: self._block_raw_length = None jpayne@68: self._load_block(handle.tell()) jpayne@68: jpayne@68: def _load_block(self, start_offset=None): jpayne@68: if start_offset is None: jpayne@68: # If the file is being read sequentially, then _handle.tell() jpayne@68: # should be pointing at the start of the next block. jpayne@68: # However, if seek has been used, we can't assume that. jpayne@68: start_offset = self._block_start_offset + self._block_raw_length jpayne@68: if start_offset == self._block_start_offset: jpayne@68: self._within_block_offset = 0 jpayne@68: return jpayne@68: elif start_offset in self._buffers: jpayne@68: # Already in cache jpayne@68: self._buffer, self._block_raw_length = self._buffers[start_offset] jpayne@68: self._within_block_offset = 0 jpayne@68: self._block_start_offset = start_offset jpayne@68: return jpayne@68: # Must hit the disk... first check cache limits, jpayne@68: while len(self._buffers) >= self.max_cache: jpayne@68: # TODO - Implement LRU cache removal? jpayne@68: self._buffers.popitem() jpayne@68: # Now load the block jpayne@68: handle = self._handle jpayne@68: if start_offset is not None: jpayne@68: handle.seek(start_offset) jpayne@68: self._block_start_offset = handle.tell() jpayne@68: try: jpayne@68: block_size, self._buffer = _load_bgzf_block(handle, self._text) jpayne@68: except StopIteration: jpayne@68: # EOF jpayne@68: block_size = 0 jpayne@68: if self._text: jpayne@68: self._buffer = "" jpayne@68: else: jpayne@68: self._buffer = b"" jpayne@68: self._within_block_offset = 0 jpayne@68: self._block_raw_length = block_size jpayne@68: # Finally save the block in our cache, jpayne@68: self._buffers[self._block_start_offset] = self._buffer, block_size jpayne@68: jpayne@68: def tell(self): jpayne@68: """Return a 64-bit unsigned BGZF virtual offset.""" jpayne@68: if 0 < self._within_block_offset and self._within_block_offset == len( jpayne@68: self._buffer jpayne@68: ): jpayne@68: # Special case where we're right at the end of a (non empty) block. jpayne@68: # For non-maximal blocks could give two possible virtual offsets, jpayne@68: # but for a maximal block can't use 65536 as the within block jpayne@68: # offset. Therefore for consistency, use the next block and a jpayne@68: # within block offset of zero. jpayne@68: return (self._block_start_offset + self._block_raw_length) << 16 jpayne@68: else: jpayne@68: # return make_virtual_offset(self._block_start_offset, jpayne@68: # self._within_block_offset) jpayne@68: # TODO - Include bounds checking as in make_virtual_offset? jpayne@68: return (self._block_start_offset << 16) | self._within_block_offset jpayne@68: jpayne@68: def seek(self, virtual_offset): jpayne@68: """Seek to a 64-bit unsigned BGZF virtual offset.""" jpayne@68: # Do this inline to avoid a function call, jpayne@68: # start_offset, within_block = split_virtual_offset(virtual_offset) jpayne@68: start_offset = virtual_offset >> 16 jpayne@68: within_block = virtual_offset ^ (start_offset << 16) jpayne@68: if start_offset != self._block_start_offset: jpayne@68: # Don't need to load the block if already there jpayne@68: # (this avoids a function call since _load_block would do nothing) jpayne@68: self._load_block(start_offset) jpayne@68: if start_offset != self._block_start_offset: jpayne@68: raise ValueError("start_offset not loaded correctly") jpayne@68: if within_block > len(self._buffer): jpayne@68: if not (within_block == 0 and len(self._buffer) == 0): jpayne@68: raise ValueError( jpayne@68: "Within offset %i but block size only %i" jpayne@68: % (within_block, len(self._buffer)) jpayne@68: ) jpayne@68: self._within_block_offset = within_block jpayne@68: # assert virtual_offset == self.tell(), \ jpayne@68: # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \ jpayne@68: # % (virtual_offset, start_offset, within_block, jpayne@68: # self.tell(), self._block_start_offset, jpayne@68: # self._within_block_offset) jpayne@68: return virtual_offset jpayne@68: jpayne@68: def read(self, size=-1): jpayne@68: """Read method for the BGZF module.""" jpayne@68: if size < 0: jpayne@68: raise NotImplementedError("Don't be greedy, that could be massive!") jpayne@68: jpayne@68: result = "" if self._text else b"" jpayne@68: while size and self._block_raw_length: jpayne@68: if self._within_block_offset + size <= len(self._buffer): jpayne@68: # This may leave us right at the end of a block jpayne@68: # (lazy loading, don't load the next block unless we have too) jpayne@68: data = self._buffer[ jpayne@68: self._within_block_offset : self._within_block_offset + size jpayne@68: ] jpayne@68: self._within_block_offset += size jpayne@68: if not data: jpayne@68: raise ValueError("Must be at least 1 byte") jpayne@68: result += data jpayne@68: break jpayne@68: else: jpayne@68: data = self._buffer[self._within_block_offset :] jpayne@68: size -= len(data) jpayne@68: self._load_block() # will reset offsets jpayne@68: result += data jpayne@68: jpayne@68: return result jpayne@68: jpayne@68: def readline(self): jpayne@68: """Read a single line for the BGZF file.""" jpayne@68: result = "" if self._text else b"" jpayne@68: while self._block_raw_length: jpayne@68: i = self._buffer.find(self._newline, self._within_block_offset) jpayne@68: # Three cases to consider, jpayne@68: if i == -1: jpayne@68: # No newline, need to read in more data jpayne@68: data = self._buffer[self._within_block_offset :] jpayne@68: self._load_block() # will reset offsets jpayne@68: result += data jpayne@68: elif i + 1 == len(self._buffer): jpayne@68: # Found new line, but right at end of block (SPECIAL) jpayne@68: data = self._buffer[self._within_block_offset :] jpayne@68: # Must now load the next block to ensure tell() works jpayne@68: self._load_block() # will reset offsets jpayne@68: if not data: jpayne@68: raise ValueError("Must be at least 1 byte") jpayne@68: result += data jpayne@68: break jpayne@68: else: jpayne@68: # Found new line, not at end of block (easy case, no IO) jpayne@68: data = self._buffer[self._within_block_offset : i + 1] jpayne@68: self._within_block_offset = i + 1 jpayne@68: # assert data.endswith(self._newline) jpayne@68: result += data jpayne@68: break jpayne@68: jpayne@68: return result jpayne@68: jpayne@68: def __next__(self): jpayne@68: """Return the next line.""" jpayne@68: line = self.readline() jpayne@68: if not line: jpayne@68: raise StopIteration jpayne@68: return line jpayne@68: jpayne@68: def __iter__(self): jpayne@68: """Iterate over the lines in the BGZF file.""" jpayne@68: return self jpayne@68: jpayne@68: def close(self): jpayne@68: """Close BGZF file.""" jpayne@68: self._handle.close() jpayne@68: self._buffer = None jpayne@68: self._block_start_offset = None jpayne@68: self._buffers = None jpayne@68: jpayne@68: def seekable(self): jpayne@68: """Return True indicating the BGZF supports random access.""" jpayne@68: return True jpayne@68: jpayne@68: def isatty(self): jpayne@68: """Return True if connected to a TTY device.""" jpayne@68: return False jpayne@68: jpayne@68: def fileno(self): jpayne@68: """Return integer file descriptor.""" jpayne@68: return self._handle.fileno() jpayne@68: jpayne@68: def __enter__(self): jpayne@68: """Open a file operable with WITH statement.""" jpayne@68: return self jpayne@68: jpayne@68: def __exit__(self, type, value, traceback): jpayne@68: """Close a file with WITH statement.""" jpayne@68: self.close() jpayne@68: jpayne@68: jpayne@68: class BgzfWriter: jpayne@68: """Define a BGZFWriter object.""" jpayne@68: jpayne@68: def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6): jpayne@68: """Initilize the class.""" jpayne@68: if filename and fileobj: jpayne@68: raise ValueError("Supply either filename or fileobj, not both") jpayne@68: # If an open file was passed, make sure it was opened in binary mode. jpayne@68: if fileobj: jpayne@68: if fileobj.read(0) != b"": jpayne@68: raise ValueError("fileobj not opened in binary mode") jpayne@68: handle = fileobj jpayne@68: else: jpayne@68: if "w" not in mode.lower() and "a" not in mode.lower(): jpayne@68: raise ValueError(f"Must use write or append mode, not {mode!r}") jpayne@68: if "a" in mode.lower(): jpayne@68: handle = _open(filename, "ab") jpayne@68: else: jpayne@68: handle = _open(filename, "wb") jpayne@68: self._text = "b" not in mode.lower() jpayne@68: self._handle = handle jpayne@68: self._buffer = b"" jpayne@68: self.compresslevel = compresslevel jpayne@68: jpayne@68: def _write_block(self, block): jpayne@68: """Write provided data to file as a single BGZF compressed block (PRIVATE).""" jpayne@68: # print("Saving %i bytes" % len(block)) jpayne@68: if len(block) > 65536: jpayne@68: raise ValueError(f"{len(block)} Block length > 65536") jpayne@68: # Giving a negative window bits means no gzip/zlib headers, jpayne@68: # -15 used in samtools jpayne@68: c = zlib.compressobj( jpayne@68: self.compresslevel, zlib.DEFLATED, -15, zlib.DEF_MEM_LEVEL, 0 jpayne@68: ) jpayne@68: compressed = c.compress(block) + c.flush() jpayne@68: del c jpayne@68: if len(compressed) > 65536: jpayne@68: raise RuntimeError( jpayne@68: "TODO - Didn't compress enough, try less data in this block" jpayne@68: ) jpayne@68: crc = zlib.crc32(block) jpayne@68: # Should cope with a mix of Python platforms... jpayne@68: if crc < 0: jpayne@68: crc = struct.pack("= 65536: jpayne@68: self._write_block(self._buffer[:65536]) jpayne@68: self._buffer = self._buffer[65536:] jpayne@68: jpayne@68: def flush(self): jpayne@68: """Flush data explicitally.""" jpayne@68: while len(self._buffer) >= 65536: jpayne@68: self._write_block(self._buffer[:65535]) jpayne@68: self._buffer = self._buffer[65535:] jpayne@68: self._write_block(self._buffer) jpayne@68: self._buffer = b"" jpayne@68: self._handle.flush() jpayne@68: jpayne@68: def close(self): jpayne@68: """Flush data, write 28 bytes BGZF EOF marker, and close BGZF file. jpayne@68: jpayne@68: samtools will look for a magic EOF marker, just a 28 byte empty BGZF jpayne@68: block, and if it is missing warns the BAM file may be truncated. In jpayne@68: addition to samtools writing this block, so too does bgzip - so this jpayne@68: implementation does too. jpayne@68: """ jpayne@68: if self._buffer: jpayne@68: self.flush() jpayne@68: self._handle.write(_bgzf_eof) jpayne@68: self._handle.flush() jpayne@68: self._handle.close() jpayne@68: jpayne@68: def tell(self): jpayne@68: """Return a BGZF 64-bit virtual offset.""" jpayne@68: return make_virtual_offset(self._handle.tell(), len(self._buffer)) jpayne@68: jpayne@68: def seekable(self): jpayne@68: """Return True indicating the BGZF supports random access.""" jpayne@68: # Not seekable, but we do support tell... jpayne@68: return False jpayne@68: jpayne@68: def isatty(self): jpayne@68: """Return True if connected to a TTY device.""" jpayne@68: return False jpayne@68: jpayne@68: def fileno(self): jpayne@68: """Return integer file descriptor.""" jpayne@68: return self._handle.fileno() jpayne@68: jpayne@68: def __enter__(self): jpayne@68: """Open a file operable with WITH statement.""" jpayne@68: return self jpayne@68: jpayne@68: def __exit__(self, type, value, traceback): jpayne@68: """Close a file with WITH statement.""" jpayne@68: self.close() jpayne@68: jpayne@68: jpayne@68: if __name__ == "__main__": jpayne@68: if len(sys.argv) > 1: jpayne@68: print("Call this with no arguments and pipe uncompressed data in on stdin") jpayne@68: print("and it will produce BGZF compressed data on stdout. e.g.") jpayne@68: print("") jpayne@68: print("./bgzf.py < example.fastq > example.fastq.bgz") jpayne@68: print("") jpayne@68: print("The extension convention of *.bgz is to distinugish these from *.gz") jpayne@68: print("used for standard gzipped files without the block structure of BGZF.") jpayne@68: print("You can use the standard gunzip command to decompress BGZF files,") jpayne@68: print("if it complains about the extension try something like this:") jpayne@68: print("") jpayne@68: print("cat example.fastq.bgz | gunzip > example.fastq") jpayne@68: print("") jpayne@68: print("See also the tool bgzip that comes with samtools") jpayne@68: sys.exit(0) jpayne@68: jpayne@68: # Ensure we have binary mode handles jpayne@68: # (leave stderr as default text mode) jpayne@68: stdin = sys.stdin.buffer jpayne@68: stdout = sys.stdout.buffer jpayne@68: jpayne@68: sys.stderr.write("Producing BGZF output from stdin...\n") jpayne@68: w = BgzfWriter(fileobj=stdout) jpayne@68: while True: jpayne@68: data = stdin.read(65536) jpayne@68: w.write(data) jpayne@68: if not data: jpayne@68: break jpayne@68: # Doing close will write an empty BGZF block as EOF marker: jpayne@68: w.close() jpayne@68: sys.stderr.write("BGZF data produced\n")