csp2: CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py annotate

annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d

author	jpayne
date	Tue, 18 Mar 2025 17:55:14 -0400
parents
children

rev	line source
jpayne@69	1 #!/usr/bin/env python
jpayne@69	2 # Copyright 2010-2018 by Peter Cock.
jpayne@69	3 # All rights reserved.
jpayne@69	4 #
jpayne@69	5 # This file is part of the Biopython distribution and governed by your
jpayne@69	6 # choice of the "Biopython License Agreement" or the "BSD 3-Clause License".
jpayne@69	7 # Please see the LICENSE file that should have been included as part of this
jpayne@69	8 # package.
jpayne@69	9 r"""Read and write BGZF compressed files (the GZIP variant used in BAM).
jpayne@69	10
jpayne@69	11 The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text
jpayne@69	12 format (SAM), and a compressed binary format (BAM). The latter uses a
jpayne@69	13 modified form of gzip compression called BGZF (Blocked GNU Zip Format),
jpayne@69	14 which can be applied to any file format to provide compression with
jpayne@69	15 efficient random access. BGZF is described together with the SAM/BAM
jpayne@69	16 file format at https://samtools.sourceforge.net/SAM1.pdf
jpayne@69	17
jpayne@69	18 Please read the text below about 'virtual offsets' before using BGZF
jpayne@69	19 files for random access.
jpayne@69	20
jpayne@69	21 Aim of this module
jpayne@69	22 ------------------
jpayne@69	23 The Python gzip library can be used to read BGZF files, since for
jpayne@69	24 decompression they are just (specialised) gzip files. What this
jpayne@69	25 module aims to facilitate is random access to BGZF files (using the
jpayne@69	26 'virtual offset' idea), and writing BGZF files (which means using
jpayne@69	27 suitably sized gzip blocks and writing the extra 'BC' field in the
jpayne@69	28 gzip headers). As in the gzip library, the zlib library is used
jpayne@69	29 internally.
jpayne@69	30
jpayne@69	31 In addition to being required for random access to and writing of
jpayne@69	32 BAM files, the BGZF format can also be used on other sequential
jpayne@69	33 data (in the sense of one record after another), such as most of
jpayne@69	34 the sequence data formats supported in Bio.SeqIO (like FASTA,
jpayne@69	35 FASTQ, GenBank, etc) or large MAF alignments.
jpayne@69	36
jpayne@69	37 The Bio.SeqIO indexing functions use this module to support BGZF files.
jpayne@69	38
jpayne@69	39 Technical Introduction to BGZF
jpayne@69	40 ------------------------------
jpayne@69	41 The gzip file format allows multiple compressed blocks, each of which
jpayne@69	42 could be a stand alone gzip file. As an interesting bonus, this means
jpayne@69	43 you can use Unix ``cat`` to combine two or more gzip files into one by
jpayne@69	44 concatenating them. Also, each block can have one of several compression
jpayne@69	45 levels (including uncompressed, which actually takes up a little bit
jpayne@69	46 more space due to the gzip header).
jpayne@69	47
jpayne@69	48 What the BAM designers realised was that while random access to data
jpayne@69	49 stored in traditional gzip files was slow, breaking the file into
jpayne@69	50 gzip blocks would allow fast random access to each block. To access
jpayne@69	51 a particular piece of the decompressed data, you just need to know
jpayne@69	52 which block it starts in (the offset of the gzip block start), and
jpayne@69	53 how far into the (decompressed) contents of the block you need to
jpayne@69	54 read.
jpayne@69	55
jpayne@69	56 One problem with this is finding the gzip block sizes efficiently.
jpayne@69	57 You can do it with a standard gzip file, but it requires every block
jpayne@69	58 to be decompressed -- and that would be rather slow. Additionally
jpayne@69	59 typical gzip files may use very large blocks.
jpayne@69	60
jpayne@69	61 All that differs in BGZF is that compressed size of each gzip block
jpayne@69	62 is limited to 2^16 bytes, and an extra 'BC' field in the gzip header
jpayne@69	63 records this size. Traditional decompression tools can ignore this,
jpayne@69	64 and unzip the file just like any other gzip file.
jpayne@69	65
jpayne@69	66 The point of this is you can look at the first BGZF block, find out
jpayne@69	67 how big it is from this 'BC' header, and thus seek immediately to
jpayne@69	68 the second block, and so on.
jpayne@69	69
jpayne@69	70 The BAM indexing scheme records read positions using a 64 bit
jpayne@69	71 'virtual offset', comprising ``coffset << 16 \| uoffset``, where ``coffset``
jpayne@69	72 is the file offset of the BGZF block containing the start of the read
jpayne@69	73 (unsigned integer using up to 64-16 = 48 bits), and ``uoffset`` is the
jpayne@69	74 offset within the (decompressed) block (unsigned 16 bit integer).
jpayne@69	75
jpayne@69	76 This limits you to BAM files where the last block starts by 2^48
jpayne@69	77 bytes, or 256 petabytes, and the decompressed size of each block
jpayne@69	78 is at most 2^16 bytes, or 64kb. Note that this matches the BGZF
jpayne@69	79 'BC' field size which limits the compressed size of each block to
jpayne@69	80 2^16 bytes, allowing for BAM files to use BGZF with no gzip
jpayne@69	81 compression (useful for intermediate files in memory to reduce
jpayne@69	82 CPU load).
jpayne@69	83
jpayne@69	84 Warning about namespaces
jpayne@69	85 ------------------------
jpayne@69	86 It is considered a bad idea to use "from XXX import ``*``" in Python, because
jpayne@69	87 it pollutes the namespace. This is a real issue with Bio.bgzf (and the
jpayne@69	88 standard Python library gzip) because they contain a function called open
jpayne@69	89 i.e. Suppose you do this:
jpayne@69	90
jpayne@69	91 >>> from Bio.bgzf import *
jpayne@69	92 >>> print(open.__module__)
jpayne@69	93 Bio.bgzf
jpayne@69	94
jpayne@69	95 Or,
jpayne@69	96
jpayne@69	97 >>> from gzip import *
jpayne@69	98 >>> print(open.__module__)
jpayne@69	99 gzip
jpayne@69	100
jpayne@69	101 Notice that the open function has been replaced. You can "fix" this if you
jpayne@69	102 need to by importing the built-in open function:
jpayne@69	103
jpayne@69	104 >>> from builtins import open
jpayne@69	105
jpayne@69	106 However, what we recommend instead is to use the explicit namespace, e.g.
jpayne@69	107
jpayne@69	108 >>> from Bio import bgzf
jpayne@69	109 >>> print(bgzf.open.__module__)
jpayne@69	110 Bio.bgzf
jpayne@69	111
jpayne@69	112
jpayne@69	113 Examples
jpayne@69	114 --------
jpayne@69	115 This is an ordinary GenBank file compressed using BGZF, so it can
jpayne@69	116 be decompressed using gzip,
jpayne@69	117
jpayne@69	118 >>> import gzip
jpayne@69	119 >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r")
jpayne@69	120 >>> assert 0 == handle.tell()
jpayne@69	121 >>> line = handle.readline()
jpayne@69	122 >>> assert 80 == handle.tell()
jpayne@69	123 >>> line = handle.readline()
jpayne@69	124 >>> assert 143 == handle.tell()
jpayne@69	125 >>> data = handle.read(70000)
jpayne@69	126 >>> assert 70143 == handle.tell()
jpayne@69	127 >>> handle.close()
jpayne@69	128
jpayne@69	129 We can also access the file using the BGZF reader - but pay
jpayne@69	130 attention to the file offsets which will be explained below:
jpayne@69	131
jpayne@69	132 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r")
jpayne@69	133 >>> assert 0 == handle.tell()
jpayne@69	134 >>> print(handle.readline().rstrip())
jpayne@69	135 LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009
jpayne@69	136 >>> assert 80 == handle.tell()
jpayne@69	137 >>> print(handle.readline().rstrip())
jpayne@69	138 DEFINITION Arabidopsis thaliana chloroplast, complete genome.
jpayne@69	139 >>> assert 143 == handle.tell()
jpayne@69	140 >>> data = handle.read(70000)
jpayne@69	141 >>> assert 987828735 == handle.tell()
jpayne@69	142 >>> print(handle.readline().rstrip())
jpayne@69	143 f="GeneID:844718"
jpayne@69	144 >>> print(handle.readline().rstrip())
jpayne@69	145 CDS complement(join(84337..84771,85454..85843))
jpayne@69	146 >>> offset = handle.seek(make_virtual_offset(55074, 126))
jpayne@69	147 >>> print(handle.readline().rstrip())
jpayne@69	148 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat
jpayne@69	149 >>> handle.close()
jpayne@69	150
jpayne@69	151 Notice the handle's offset looks different as a BGZF file. This
jpayne@69	152 brings us to the key point about BGZF, which is the block structure:
jpayne@69	153
jpayne@69	154 >>> handle = open("GenBank/NC_000932.gb.bgz", "rb")
jpayne@69	155 >>> for values in BgzfBlocks(handle):
jpayne@69	156 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
jpayne@69	157 Raw start 0, raw length 15073; data start 0, data length 65536
jpayne@69	158 Raw start 15073, raw length 17857; data start 65536, data length 65536
jpayne@69	159 Raw start 32930, raw length 22144; data start 131072, data length 65536
jpayne@69	160 Raw start 55074, raw length 22230; data start 196608, data length 65536
jpayne@69	161 Raw start 77304, raw length 14939; data start 262144, data length 43478
jpayne@69	162 Raw start 92243, raw length 28; data start 305622, data length 0
jpayne@69	163 >>> handle.close()
jpayne@69	164
jpayne@69	165 In this example the first three blocks are 'full' and hold 65536 bytes
jpayne@69	166 of uncompressed data. The fourth block isn't full and holds 43478 bytes.
jpayne@69	167 Finally there is a special empty fifth block which takes 28 bytes on
jpayne@69	168 disk and serves as an 'end of file' (EOF) marker. If this is missing,
jpayne@69	169 it is possible your BGZF file is incomplete.
jpayne@69	170
jpayne@69	171 By reading ahead 70,000 bytes we moved into the second BGZF block,
jpayne@69	172 and at that point the BGZF virtual offsets start to look different
jpayne@69	173 to a simple offset into the decompressed data as exposed by the gzip
jpayne@69	174 library.
jpayne@69	175
jpayne@69	176 As an example, consider seeking to the decompressed position 196734.
jpayne@69	177 Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this
jpayne@69	178 is equivalent to jumping the first three blocks (which in this
jpayne@69	179 specific example are all size 65536 after decompression - which
jpayne@69	180 does not always hold) and starting at byte 126 of the fourth block
jpayne@69	181 (after decompression). For BGZF, we need to know the fourth block's
jpayne@69	182 offset of 55074 and the offset within the block of 126 to get the
jpayne@69	183 BGZF virtual offset.
jpayne@69	184
jpayne@69	185 >>> print(55074 << 16 \| 126)
jpayne@69	186 3609329790
jpayne@69	187 >>> print(bgzf.make_virtual_offset(55074, 126))
jpayne@69	188 3609329790
jpayne@69	189
jpayne@69	190 Thus for this BGZF file, decompressed position 196734 corresponds
jpayne@69	191 to the virtual offset 3609329790. However, another BGZF file with
jpayne@69	192 different contents would have compressed more or less efficiently,
jpayne@69	193 so the compressed blocks would be different sizes. What this means
jpayne@69	194 is the mapping between the uncompressed offset and the compressed
jpayne@69	195 virtual offset depends on the BGZF file you are using.
jpayne@69	196
jpayne@69	197 If you are accessing a BGZF file via this module, just use the
jpayne@69	198 handle.tell() method to note the virtual offset of a position you
jpayne@69	199 may later want to return to using handle.seek().
jpayne@69	200
jpayne@69	201 The catch with BGZF virtual offsets is while they can be compared
jpayne@69	202 (which offset comes first in the file), you cannot safely subtract
jpayne@69	203 them to get the size of the data between them, nor add/subtract
jpayne@69	204 a relative offset.
jpayne@69	205
jpayne@69	206 Of course you can parse this file with Bio.SeqIO using BgzfReader,
jpayne@69	207 although there isn't any benefit over using gzip.open(...), unless
jpayne@69	208 you want to index BGZF compressed sequence files:
jpayne@69	209
jpayne@69	210 >>> from Bio import SeqIO
jpayne@69	211 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz")
jpayne@69	212 >>> record = SeqIO.read(handle, "genbank")
jpayne@69	213 >>> handle.close()
jpayne@69	214 >>> print(record.id)
jpayne@69	215 NC_000932.1
jpayne@69	216
jpayne@69	217 Text Mode
jpayne@69	218 ---------
jpayne@69	219
jpayne@69	220 Like the standard library gzip.open(...), the BGZF code defaults to opening
jpayne@69	221 files in binary mode.
jpayne@69	222
jpayne@69	223 You can request the file be opened in text mode, but beware that this is hard
jpayne@69	224 coded to the simple "latin1" (aka "iso-8859-1") encoding (which includes all
jpayne@69	225 the ASCII characters), which works well with most Western European languages.
jpayne@69	226 However, it is not fully compatible with the more widely used UTF-8 encoding.
jpayne@69	227
jpayne@69	228 In variable width encodings like UTF-8, some single characters in the unicode
jpayne@69	229 text output are represented by multiple bytes in the raw binary form. This is
jpayne@69	230 problematic with BGZF, as we cannot always decode each block in isolation - a
jpayne@69	231 single unicode character could be split over two blocks. This can even happen
jpayne@69	232 with fixed width unicode encodings, as the BGZF block size is not fixed.
jpayne@69	233
jpayne@69	234 Therefore, this module is currently restricted to only support single byte
jpayne@69	235 unicode encodings, such as ASCII, "latin1" (which is a superset of ASCII), or
jpayne@69	236 potentially other character maps (not implemented).
jpayne@69	237
jpayne@69	238 Furthermore, unlike the default text mode on Python 3, we do not attempt to
jpayne@69	239 implement universal new line mode. This transforms the various operating system
jpayne@69	240 new line conventions like Windows (CR LF or "\r\n"), Unix (just LF, "\n"), or
jpayne@69	241 old Macs (just CR, "\r"), into just LF ("\n"). Here we have the same problem -
jpayne@69	242 is "\r" at the end of a block an incomplete Windows style new line?
jpayne@69	243
jpayne@69	244 Instead, you will get the CR ("\r") and LF ("\n") characters as is.
jpayne@69	245
jpayne@69	246 If your data is in UTF-8 or any other incompatible encoding, you must use
jpayne@69	247 binary mode, and decode the appropriate fragments yourself.
jpayne@69	248 """
jpayne@69	249
jpayne@69	250 import struct
jpayne@69	251 import sys
jpayne@69	252 import zlib
jpayne@69	253
jpayne@69	254 from builtins import open as _open
jpayne@69	255
jpayne@69	256 _bgzf_magic = b"\x1f\x8b\x08\x04"
jpayne@69	257 _bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00"
jpayne@69	258 _bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00"
jpayne@69	259 _bytes_BC = b"BC"
jpayne@69	260
jpayne@69	261
jpayne@69	262 def open(filename, mode="rb"):
jpayne@69	263 r"""Open a BGZF file for reading, writing or appending.
jpayne@69	264
jpayne@69	265 If text mode is requested, in order to avoid multi-byte characters, this is
jpayne@69	266 hard coded to use the "latin1" encoding, and "\r" and "\n" are passed as is
jpayne@69	267 (without implementing universal new line mode).
jpayne@69	268
jpayne@69	269 If your data is in UTF-8 or any other incompatible encoding, you must use
jpayne@69	270 binary mode, and decode the appropriate fragments yourself.
jpayne@69	271 """
jpayne@69	272 if "r" in mode.lower():
jpayne@69	273 return BgzfReader(filename, mode)
jpayne@69	274 elif "w" in mode.lower() or "a" in mode.lower():
jpayne@69	275 return BgzfWriter(filename, mode)
jpayne@69	276 else:
jpayne@69	277 raise ValueError(f"Bad mode {mode!r}")
jpayne@69	278
jpayne@69	279
jpayne@69	280 def make_virtual_offset(block_start_offset, within_block_offset):
jpayne@69	281 """Compute a BGZF virtual offset from block start and within block offsets.
jpayne@69	282
jpayne@69	283 The BAM indexing scheme records read positions using a 64 bit
jpayne@69	284 'virtual offset', comprising in C terms:
jpayne@69	285
jpayne@69	286 block_start_offset << 16 \| within_block_offset
jpayne@69	287
jpayne@69	288 Here block_start_offset is the file offset of the BGZF block
jpayne@69	289 start (unsigned integer using up to 64-16 = 48 bits), and
jpayne@69	290 within_block_offset within the (decompressed) block (unsigned
jpayne@69	291 16 bit integer).
jpayne@69	292
jpayne@69	293 >>> make_virtual_offset(0, 0)
jpayne@69	294 0
jpayne@69	295 >>> make_virtual_offset(0, 1)
jpayne@69	296 1
jpayne@69	297 >>> make_virtual_offset(0, 2**16 - 1)
jpayne@69	298 65535
jpayne@69	299 >>> make_virtual_offset(0, 2**16)
jpayne@69	300 Traceback (most recent call last):
jpayne@69	301 ...
jpayne@69	302 ValueError: Require 0 <= within_block_offset < 2**16, got 65536
jpayne@69	303
jpayne@69	304 >>> 65536 == make_virtual_offset(1, 0)
jpayne@69	305 True
jpayne@69	306 >>> 65537 == make_virtual_offset(1, 1)
jpayne@69	307 True
jpayne@69	308 >>> 131071 == make_virtual_offset(1, 2**16 - 1)
jpayne@69	309 True
jpayne@69	310
jpayne@69	311 >>> 6553600000 == make_virtual_offset(100000, 0)
jpayne@69	312 True
jpayne@69	313 >>> 6553600001 == make_virtual_offset(100000, 1)
jpayne@69	314 True
jpayne@69	315 >>> 6553600010 == make_virtual_offset(100000, 10)
jpayne@69	316 True
jpayne@69	317
jpayne@69	318 >>> make_virtual_offset(2**48, 0)
jpayne@69	319 Traceback (most recent call last):
jpayne@69	320 ...
jpayne@69	321 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656
jpayne@69	322
jpayne@69	323 """
jpayne@69	324 if within_block_offset < 0 or within_block_offset >= 65536:
jpayne@69	325 raise ValueError(
jpayne@69	326 "Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset
jpayne@69	327 )
jpayne@69	328 if block_start_offset < 0 or block_start_offset >= 281474976710656:
jpayne@69	329 raise ValueError(
jpayne@69	330 "Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset
jpayne@69	331 )
jpayne@69	332 return (block_start_offset << 16) \| within_block_offset
jpayne@69	333
jpayne@69	334
jpayne@69	335 def split_virtual_offset(virtual_offset):
jpayne@69	336 """Divides a 64-bit BGZF virtual offset into block start & within block offsets.
jpayne@69	337
jpayne@69	338 >>> (100000, 0) == split_virtual_offset(6553600000)
jpayne@69	339 True
jpayne@69	340 >>> (100000, 10) == split_virtual_offset(6553600010)
jpayne@69	341 True
jpayne@69	342
jpayne@69	343 """
jpayne@69	344 start = virtual_offset >> 16
jpayne@69	345 return start, virtual_offset ^ (start << 16)
jpayne@69	346
jpayne@69	347
jpayne@69	348 def BgzfBlocks(handle):
jpayne@69	349 """Low level debugging function to inspect BGZF blocks.
jpayne@69	350
jpayne@69	351 Expects a BGZF compressed file opened in binary read mode using
jpayne@69	352 the builtin open function. Do not use a handle from this bgzf
jpayne@69	353 module or the gzip module's open function which will decompress
jpayne@69	354 the file.
jpayne@69	355
jpayne@69	356 Returns the block start offset (see virtual offsets), the block
jpayne@69	357 length (add these for the start of the next block), and the
jpayne@69	358 decompressed length of the blocks contents (limited to 65536 in
jpayne@69	359 BGZF), as an iterator - one tuple per BGZF block.
jpayne@69	360
jpayne@69	361 >>> from builtins import open
jpayne@69	362 >>> handle = open("SamBam/ex1.bam", "rb")
jpayne@69	363 >>> for values in BgzfBlocks(handle):
jpayne@69	364 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
jpayne@69	365 Raw start 0, raw length 18239; data start 0, data length 65536
jpayne@69	366 Raw start 18239, raw length 18223; data start 65536, data length 65536
jpayne@69	367 Raw start 36462, raw length 18017; data start 131072, data length 65536
jpayne@69	368 Raw start 54479, raw length 17342; data start 196608, data length 65536
jpayne@69	369 Raw start 71821, raw length 17715; data start 262144, data length 65536
jpayne@69	370 Raw start 89536, raw length 17728; data start 327680, data length 65536
jpayne@69	371 Raw start 107264, raw length 17292; data start 393216, data length 63398
jpayne@69	372 Raw start 124556, raw length 28; data start 456614, data length 0
jpayne@69	373 >>> handle.close()
jpayne@69	374
jpayne@69	375 Indirectly we can tell this file came from an old version of
jpayne@69	376 samtools because all the blocks (except the final one and the
jpayne@69	377 dummy empty EOF marker block) are 65536 bytes. Later versions
jpayne@69	378 avoid splitting a read between two blocks, and give the header
jpayne@69	379 its own block (useful to speed up replacing the header). You
jpayne@69	380 can see this in ex1_refresh.bam created using samtools 0.1.18:
jpayne@69	381
jpayne@69	382 samtools view -b ex1.bam > ex1_refresh.bam
jpayne@69	383
jpayne@69	384 >>> handle = open("SamBam/ex1_refresh.bam", "rb")
jpayne@69	385 >>> for values in BgzfBlocks(handle):
jpayne@69	386 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
jpayne@69	387 Raw start 0, raw length 53; data start 0, data length 38
jpayne@69	388 Raw start 53, raw length 18195; data start 38, data length 65434
jpayne@69	389 Raw start 18248, raw length 18190; data start 65472, data length 65409
jpayne@69	390 Raw start 36438, raw length 18004; data start 130881, data length 65483
jpayne@69	391 Raw start 54442, raw length 17353; data start 196364, data length 65519
jpayne@69	392 Raw start 71795, raw length 17708; data start 261883, data length 65411
jpayne@69	393 Raw start 89503, raw length 17709; data start 327294, data length 65466
jpayne@69	394 Raw start 107212, raw length 17390; data start 392760, data length 63854
jpayne@69	395 Raw start 124602, raw length 28; data start 456614, data length 0
jpayne@69	396 >>> handle.close()
jpayne@69	397
jpayne@69	398 The above example has no embedded SAM header (thus the first block
jpayne@69	399 is very small at just 38 bytes of decompressed data), while the next
jpayne@69	400 example does (a larger block of 103 bytes). Notice that the rest of
jpayne@69	401 the blocks show the same sizes (they contain the same read data):
jpayne@69	402
jpayne@69	403 >>> handle = open("SamBam/ex1_header.bam", "rb")
jpayne@69	404 >>> for values in BgzfBlocks(handle):
jpayne@69	405 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
jpayne@69	406 Raw start 0, raw length 104; data start 0, data length 103
jpayne@69	407 Raw start 104, raw length 18195; data start 103, data length 65434
jpayne@69	408 Raw start 18299, raw length 18190; data start 65537, data length 65409
jpayne@69	409 Raw start 36489, raw length 18004; data start 130946, data length 65483
jpayne@69	410 Raw start 54493, raw length 17353; data start 196429, data length 65519
jpayne@69	411 Raw start 71846, raw length 17708; data start 261948, data length 65411
jpayne@69	412 Raw start 89554, raw length 17709; data start 327359, data length 65466
jpayne@69	413 Raw start 107263, raw length 17390; data start 392825, data length 63854
jpayne@69	414 Raw start 124653, raw length 28; data start 456679, data length 0
jpayne@69	415 >>> handle.close()
jpayne@69	416
jpayne@69	417 """
jpayne@69	418 if isinstance(handle, BgzfReader):
jpayne@69	419 raise TypeError("Function BgzfBlocks expects a binary handle")
jpayne@69	420 data_start = 0
jpayne@69	421 while True:
jpayne@69	422 start_offset = handle.tell()
jpayne@69	423 try:
jpayne@69	424 block_length, data = _load_bgzf_block(handle)
jpayne@69	425 except StopIteration:
jpayne@69	426 break
jpayne@69	427 data_len = len(data)
jpayne@69	428 yield start_offset, block_length, data_start, data_len
jpayne@69	429 data_start += data_len
jpayne@69	430
jpayne@69	431
jpayne@69	432 def _load_bgzf_block(handle, text_mode=False):
jpayne@69	433 """Load the next BGZF block of compressed data (PRIVATE).
jpayne@69	434
jpayne@69	435 Returns a tuple (block size and data), or at end of file
jpayne@69	436 will raise StopIteration.
jpayne@69	437 """
jpayne@69	438 magic = handle.read(4)
jpayne@69	439 if not magic:
jpayne@69	440 # End of file - should we signal this differently now?
jpayne@69	441 # See https://www.python.org/dev/peps/pep-0479/
jpayne@69	442 raise StopIteration
jpayne@69	443 if magic != _bgzf_magic:
jpayne@69	444 raise ValueError(
jpayne@69	445 r"A BGZF (e.g. a BAM file) block should start with "
jpayne@69	446 r"%r, not %r; handle.tell() now says %r"
jpayne@69	447 % (_bgzf_magic, magic, handle.tell())
jpayne@69	448 )
jpayne@69	449 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = struct.unpack(
jpayne@69	450 "<LBBH", handle.read(8)
jpayne@69	451 )
jpayne@69	452
jpayne@69	453 block_size = None
jpayne@69	454 x_len = 0
jpayne@69	455 while x_len < extra_len:
jpayne@69	456 subfield_id = handle.read(2)
jpayne@69	457 subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t
jpayne@69	458 subfield_data = handle.read(subfield_len)
jpayne@69	459 x_len += subfield_len + 4
jpayne@69	460 if subfield_id == _bytes_BC:
jpayne@69	461 if subfield_len != 2:
jpayne@69	462 raise ValueError("Wrong BC payload length")
jpayne@69	463 if block_size is not None:
jpayne@69	464 raise ValueError("Two BC subfields?")
jpayne@69	465 block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t
jpayne@69	466 if x_len != extra_len:
jpayne@69	467 raise ValueError(f"x_len and extra_len differ {x_len}, {extra_len}")
jpayne@69	468 if block_size is None:
jpayne@69	469 raise ValueError("Missing BC, this isn't a BGZF file!")
jpayne@69	470 # Now comes the compressed data, CRC, and length of uncompressed data.
jpayne@69	471 deflate_size = block_size - 1 - extra_len - 19
jpayne@69	472 d = zlib.decompressobj(-15) # Negative window size means no headers
jpayne@69	473 data = d.decompress(handle.read(deflate_size)) + d.flush()
jpayne@69	474 expected_crc = handle.read(4)
jpayne@69	475 expected_size = struct.unpack("<I", handle.read(4))[0]
jpayne@69	476 if expected_size != len(data):
jpayne@69	477 raise RuntimeError("Decompressed to %i, not %i" % (len(data), expected_size))
jpayne@69	478 # Should cope with a mix of Python platforms...
jpayne@69	479 crc = zlib.crc32(data)
jpayne@69	480 if crc < 0:
jpayne@69	481 crc = struct.pack("<i", crc)
jpayne@69	482 else:
jpayne@69	483 crc = struct.pack("<I", crc)
jpayne@69	484 if expected_crc != crc:
jpayne@69	485 raise RuntimeError(f"CRC is {crc}, not {expected_crc}")
jpayne@69	486 if text_mode:
jpayne@69	487 # Note ISO-8859-1 aka Latin-1 preserves first 256 chars
jpayne@69	488 # (i.e. ASCII), but critically is a single byte encoding
jpayne@69	489 return block_size, data.decode("latin-1")
jpayne@69	490 else:
jpayne@69	491 return block_size, data
jpayne@69	492
jpayne@69	493
jpayne@69	494 class BgzfReader:
jpayne@69	495 r"""BGZF reader, acts like a read only handle but seek/tell differ.
jpayne@69	496
jpayne@69	497 Let's use the BgzfBlocks function to have a peek at the BGZF blocks
jpayne@69	498 in an example BAM file,
jpayne@69	499
jpayne@69	500 >>> from builtins import open
jpayne@69	501 >>> handle = open("SamBam/ex1.bam", "rb")
jpayne@69	502 >>> for values in BgzfBlocks(handle):
jpayne@69	503 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
jpayne@69	504 Raw start 0, raw length 18239; data start 0, data length 65536
jpayne@69	505 Raw start 18239, raw length 18223; data start 65536, data length 65536
jpayne@69	506 Raw start 36462, raw length 18017; data start 131072, data length 65536
jpayne@69	507 Raw start 54479, raw length 17342; data start 196608, data length 65536
jpayne@69	508 Raw start 71821, raw length 17715; data start 262144, data length 65536
jpayne@69	509 Raw start 89536, raw length 17728; data start 327680, data length 65536
jpayne@69	510 Raw start 107264, raw length 17292; data start 393216, data length 63398
jpayne@69	511 Raw start 124556, raw length 28; data start 456614, data length 0
jpayne@69	512 >>> handle.close()
jpayne@69	513
jpayne@69	514 Now let's see how to use this block information to jump to
jpayne@69	515 specific parts of the decompressed BAM file:
jpayne@69	516
jpayne@69	517 >>> handle = BgzfReader("SamBam/ex1.bam", "rb")
jpayne@69	518 >>> assert 0 == handle.tell()
jpayne@69	519 >>> magic = handle.read(4)
jpayne@69	520 >>> assert 4 == handle.tell()
jpayne@69	521
jpayne@69	522 So far nothing so strange, we got the magic marker used at the
jpayne@69	523 start of a decompressed BAM file, and the handle position makes
jpayne@69	524 sense. Now however, let's jump to the end of this block and 4
jpayne@69	525 bytes into the next block by reading 65536 bytes,
jpayne@69	526
jpayne@69	527 >>> data = handle.read(65536)
jpayne@69	528 >>> len(data)
jpayne@69	529 65536
jpayne@69	530 >>> assert 1195311108 == handle.tell()
jpayne@69	531
jpayne@69	532 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit
jpayne@69	533 virtual offset, which means:
jpayne@69	534
jpayne@69	535 >>> split_virtual_offset(1195311108)
jpayne@69	536 (18239, 4)
jpayne@69	537
jpayne@69	538 You should spot 18239 as the start of the second BGZF block, while
jpayne@69	539 the 4 is the offset into this block. See also make_virtual_offset,
jpayne@69	540
jpayne@69	541 >>> make_virtual_offset(18239, 4)
jpayne@69	542 1195311108
jpayne@69	543
jpayne@69	544 Let's jump back to almost the start of the file,
jpayne@69	545
jpayne@69	546 >>> make_virtual_offset(0, 2)
jpayne@69	547 2
jpayne@69	548 >>> handle.seek(2)
jpayne@69	549 2
jpayne@69	550 >>> handle.close()
jpayne@69	551
jpayne@69	552 Note that you can use the max_cache argument to limit the number of
jpayne@69	553 BGZF blocks cached in memory. The default is 100, and since each
jpayne@69	554 block can be up to 64kb, the default cache could take up to 6MB of
jpayne@69	555 RAM. The cache is not important for reading through the file in one
jpayne@69	556 pass, but is important for improving performance of random access.
jpayne@69	557 """
jpayne@69	558
jpayne@69	559 def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
jpayne@69	560 r"""Initialize the class for reading a BGZF file.
jpayne@69	561
jpayne@69	562 You would typically use the top level ``bgzf.open(...)`` function
jpayne@69	563 which will call this class internally. Direct use is discouraged.
jpayne@69	564
jpayne@69	565 Either the ``filename`` (string) or ``fileobj`` (input file object in
jpayne@69	566 binary mode) arguments must be supplied, but not both.
jpayne@69	567
jpayne@69	568 Argument ``mode`` controls if the data will be returned as strings in
jpayne@69	569 text mode ("rt", "tr", or default "r"), or bytes binary mode ("rb"
jpayne@69	570 or "br"). The argument name matches the built-in ``open(...)`` and
jpayne@69	571 standard library ``gzip.open(...)`` function.
jpayne@69	572
jpayne@69	573 If text mode is requested, in order to avoid multi-byte characters,
jpayne@69	574 this is hard coded to use the "latin1" encoding, and "\r" and "\n"
jpayne@69	575 are passed as is (without implementing universal new line mode). There
jpayne@69	576 is no ``encoding`` argument.
jpayne@69	577
jpayne@69	578 If your data is in UTF-8 or any other incompatible encoding, you must
jpayne@69	579 use binary mode, and decode the appropriate fragments yourself.
jpayne@69	580
jpayne@69	581 Argument ``max_cache`` controls the maximum number of BGZF blocks to
jpayne@69	582 cache in memory. Each can be up to 64kb thus the default of 100 blocks
jpayne@69	583 could take up to 6MB of RAM. This is important for efficient random
jpayne@69	584 access, a small value is fine for reading the file in one pass.
jpayne@69	585 """
jpayne@69	586 # TODO - Assuming we can seek, check for 28 bytes EOF empty block
jpayne@69	587 # and if missing warn about possible truncation (as in samtools)?
jpayne@69	588 if max_cache < 1:
jpayne@69	589 raise ValueError("Use max_cache with a minimum of 1")
jpayne@69	590 # Must open the BGZF file in binary mode, but we may want to
jpayne@69	591 # treat the contents as either text or binary (unicode or
jpayne@69	592 # bytes under Python 3)
jpayne@69	593 if filename and fileobj:
jpayne@69	594 raise ValueError("Supply either filename or fileobj, not both")
jpayne@69	595 # Want to reject output modes like w, a, x, +
jpayne@69	596 if mode.lower() not in ("r", "tr", "rt", "rb", "br"):
jpayne@69	597 raise ValueError(
jpayne@69	598 "Must use a read mode like 'r' (default), 'rt', or 'rb' for binary"
jpayne@69	599 )
jpayne@69	600 # If an open file was passed, make sure it was opened in binary mode.
jpayne@69	601 if fileobj:
jpayne@69	602 if fileobj.read(0) != b"":
jpayne@69	603 raise ValueError("fileobj not opened in binary mode")
jpayne@69	604 handle = fileobj
jpayne@69	605 else:
jpayne@69	606 handle = _open(filename, "rb")
jpayne@69	607 self._text = "b" not in mode.lower()
jpayne@69	608 if self._text:
jpayne@69	609 self._newline = "\n"
jpayne@69	610 else:
jpayne@69	611 self._newline = b"\n"
jpayne@69	612 self._handle = handle
jpayne@69	613 self.max_cache = max_cache
jpayne@69	614 self._buffers = {}
jpayne@69	615 self._block_start_offset = None
jpayne@69	616 self._block_raw_length = None
jpayne@69	617 self._load_block(handle.tell())
jpayne@69	618
jpayne@69	619 def _load_block(self, start_offset=None):
jpayne@69	620 if start_offset is None:
jpayne@69	621 # If the file is being read sequentially, then _handle.tell()
jpayne@69	622 # should be pointing at the start of the next block.
jpayne@69	623 # However, if seek has been used, we can't assume that.
jpayne@69	624 start_offset = self._block_start_offset + self._block_raw_length
jpayne@69	625 if start_offset == self._block_start_offset:
jpayne@69	626 self._within_block_offset = 0
jpayne@69	627 return
jpayne@69	628 elif start_offset in self._buffers:
jpayne@69	629 # Already in cache
jpayne@69	630 self._buffer, self._block_raw_length = self._buffers[start_offset]
jpayne@69	631 self._within_block_offset = 0
jpayne@69	632 self._block_start_offset = start_offset
jpayne@69	633 return
jpayne@69	634 # Must hit the disk... first check cache limits,
jpayne@69	635 while len(self._buffers) >= self.max_cache:
jpayne@69	636 # TODO - Implement LRU cache removal?
jpayne@69	637 self._buffers.popitem()
jpayne@69	638 # Now load the block
jpayne@69	639 handle = self._handle
jpayne@69	640 if start_offset is not None:
jpayne@69	641 handle.seek(start_offset)
jpayne@69	642 self._block_start_offset = handle.tell()
jpayne@69	643 try:
jpayne@69	644 block_size, self._buffer = _load_bgzf_block(handle, self._text)
jpayne@69	645 except StopIteration:
jpayne@69	646 # EOF
jpayne@69	647 block_size = 0
jpayne@69	648 if self._text:
jpayne@69	649 self._buffer = ""
jpayne@69	650 else:
jpayne@69	651 self._buffer = b""
jpayne@69	652 self._within_block_offset = 0
jpayne@69	653 self._block_raw_length = block_size
jpayne@69	654 # Finally save the block in our cache,
jpayne@69	655 self._buffers[self._block_start_offset] = self._buffer, block_size
jpayne@69	656
jpayne@69	657 def tell(self):
jpayne@69	658 """Return a 64-bit unsigned BGZF virtual offset."""
jpayne@69	659 if 0 < self._within_block_offset and self._within_block_offset == len(
jpayne@69	660 self._buffer
jpayne@69	661 ):
jpayne@69	662 # Special case where we're right at the end of a (non empty) block.
jpayne@69	663 # For non-maximal blocks could give two possible virtual offsets,
jpayne@69	664 # but for a maximal block can't use 65536 as the within block
jpayne@69	665 # offset. Therefore for consistency, use the next block and a
jpayne@69	666 # within block offset of zero.
jpayne@69	667 return (self._block_start_offset + self._block_raw_length) << 16
jpayne@69	668 else:
jpayne@69	669 # return make_virtual_offset(self._block_start_offset,
jpayne@69	670 # self._within_block_offset)
jpayne@69	671 # TODO - Include bounds checking as in make_virtual_offset?
jpayne@69	672 return (self._block_start_offset << 16) \| self._within_block_offset
jpayne@69	673
jpayne@69	674 def seek(self, virtual_offset):
jpayne@69	675 """Seek to a 64-bit unsigned BGZF virtual offset."""
jpayne@69	676 # Do this inline to avoid a function call,
jpayne@69	677 # start_offset, within_block = split_virtual_offset(virtual_offset)
jpayne@69	678 start_offset = virtual_offset >> 16
jpayne@69	679 within_block = virtual_offset ^ (start_offset << 16)
jpayne@69	680 if start_offset != self._block_start_offset:
jpayne@69	681 # Don't need to load the block if already there
jpayne@69	682 # (this avoids a function call since _load_block would do nothing)
jpayne@69	683 self._load_block(start_offset)
jpayne@69	684 if start_offset != self._block_start_offset:
jpayne@69	685 raise ValueError("start_offset not loaded correctly")
jpayne@69	686 if within_block > len(self._buffer):
jpayne@69	687 if not (within_block == 0 and len(self._buffer) == 0):
jpayne@69	688 raise ValueError(
jpayne@69	689 "Within offset %i but block size only %i"
jpayne@69	690 % (within_block, len(self._buffer))
jpayne@69	691 )
jpayne@69	692 self._within_block_offset = within_block
jpayne@69	693 # assert virtual_offset == self.tell(), \
jpayne@69	694 # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \
jpayne@69	695 # % (virtual_offset, start_offset, within_block,
jpayne@69	696 # self.tell(), self._block_start_offset,
jpayne@69	697 # self._within_block_offset)
jpayne@69	698 return virtual_offset
jpayne@69	699
jpayne@69	700 def read(self, size=-1):
jpayne@69	701 """Read method for the BGZF module."""
jpayne@69	702 if size < 0:
jpayne@69	703 raise NotImplementedError("Don't be greedy, that could be massive!")
jpayne@69	704
jpayne@69	705 result = "" if self._text else b""
jpayne@69	706 while size and self._block_raw_length:
jpayne@69	707 if self._within_block_offset + size <= len(self._buffer):
jpayne@69	708 # This may leave us right at the end of a block
jpayne@69	709 # (lazy loading, don't load the next block unless we have too)
jpayne@69	710 data = self._buffer[
jpayne@69	711 self._within_block_offset : self._within_block_offset + size
jpayne@69	712 ]
jpayne@69	713 self._within_block_offset += size
jpayne@69	714 if not data:
jpayne@69	715 raise ValueError("Must be at least 1 byte")
jpayne@69	716 result += data
jpayne@69	717 break
jpayne@69	718 else:
jpayne@69	719 data = self._buffer[self._within_block_offset :]
jpayne@69	720 size -= len(data)
jpayne@69	721 self._load_block() # will reset offsets
jpayne@69	722 result += data
jpayne@69	723
jpayne@69	724 return result
jpayne@69	725
jpayne@69	726 def readline(self):
jpayne@69	727 """Read a single line for the BGZF file."""
jpayne@69	728 result = "" if self._text else b""
jpayne@69	729 while self._block_raw_length:
jpayne@69	730 i = self._buffer.find(self._newline, self._within_block_offset)
jpayne@69	731 # Three cases to consider,
jpayne@69	732 if i == -1:
jpayne@69	733 # No newline, need to read in more data
jpayne@69	734 data = self._buffer[self._within_block_offset :]
jpayne@69	735 self._load_block() # will reset offsets
jpayne@69	736 result += data
jpayne@69	737 elif i + 1 == len(self._buffer):
jpayne@69	738 # Found new line, but right at end of block (SPECIAL)
jpayne@69	739 data = self._buffer[self._within_block_offset :]
jpayne@69	740 # Must now load the next block to ensure tell() works
jpayne@69	741 self._load_block() # will reset offsets
jpayne@69	742 if not data:
jpayne@69	743 raise ValueError("Must be at least 1 byte")
jpayne@69	744 result += data
jpayne@69	745 break
jpayne@69	746 else:
jpayne@69	747 # Found new line, not at end of block (easy case, no IO)
jpayne@69	748 data = self._buffer[self._within_block_offset : i + 1]
jpayne@69	749 self._within_block_offset = i + 1
jpayne@69	750 # assert data.endswith(self._newline)
jpayne@69	751 result += data
jpayne@69	752 break
jpayne@69	753
jpayne@69	754 return result
jpayne@69	755
jpayne@69	756 def __next__(self):
jpayne@69	757 """Return the next line."""
jpayne@69	758 line = self.readline()
jpayne@69	759 if not line:
jpayne@69	760 raise StopIteration
jpayne@69	761 return line
jpayne@69	762
jpayne@69	763 def __iter__(self):
jpayne@69	764 """Iterate over the lines in the BGZF file."""
jpayne@69	765 return self
jpayne@69	766
jpayne@69	767 def close(self):
jpayne@69	768 """Close BGZF file."""
jpayne@69	769 self._handle.close()
jpayne@69	770 self._buffer = None
jpayne@69	771 self._block_start_offset = None
jpayne@69	772 self._buffers = None
jpayne@69	773
jpayne@69	774 def seekable(self):
jpayne@69	775 """Return True indicating the BGZF supports random access."""
jpayne@69	776 return True
jpayne@69	777
jpayne@69	778 def isatty(self):
jpayne@69	779 """Return True if connected to a TTY device."""
jpayne@69	780 return False
jpayne@69	781
jpayne@69	782 def fileno(self):
jpayne@69	783 """Return integer file descriptor."""
jpayne@69	784 return self._handle.fileno()
jpayne@69	785
jpayne@69	786 def __enter__(self):
jpayne@69	787 """Open a file operable with WITH statement."""
jpayne@69	788 return self
jpayne@69	789
jpayne@69	790 def __exit__(self, type, value, traceback):
jpayne@69	791 """Close a file with WITH statement."""
jpayne@69	792 self.close()
jpayne@69	793
jpayne@69	794
jpayne@69	795 class BgzfWriter:
jpayne@69	796 """Define a BGZFWriter object."""
jpayne@69	797
jpayne@69	798 def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
jpayne@69	799 """Initilize the class."""
jpayne@69	800 if filename and fileobj:
jpayne@69	801 raise ValueError("Supply either filename or fileobj, not both")
jpayne@69	802 # If an open file was passed, make sure it was opened in binary mode.
jpayne@69	803 if fileobj:
jpayne@69	804 if fileobj.read(0) != b"":
jpayne@69	805 raise ValueError("fileobj not opened in binary mode")
jpayne@69	806 handle = fileobj
jpayne@69	807 else:
jpayne@69	808 if "w" not in mode.lower() and "a" not in mode.lower():
jpayne@69	809 raise ValueError(f"Must use write or append mode, not {mode!r}")
jpayne@69	810 if "a" in mode.lower():
jpayne@69	811 handle = _open(filename, "ab")
jpayne@69	812 else:
jpayne@69	813 handle = _open(filename, "wb")
jpayne@69	814 self._text = "b" not in mode.lower()
jpayne@69	815 self._handle = handle
jpayne@69	816 self._buffer = b""
jpayne@69	817 self.compresslevel = compresslevel
jpayne@69	818
jpayne@69	819 def _write_block(self, block):
jpayne@69	820 """Write provided data to file as a single BGZF compressed block (PRIVATE)."""
jpayne@69	821 # print("Saving %i bytes" % len(block))
jpayne@69	822 if len(block) > 65536:
jpayne@69	823 raise ValueError(f"{len(block)} Block length > 65536")
jpayne@69	824 # Giving a negative window bits means no gzip/zlib headers,
jpayne@69	825 # -15 used in samtools
jpayne@69	826 c = zlib.compressobj(
jpayne@69	827 self.compresslevel, zlib.DEFLATED, -15, zlib.DEF_MEM_LEVEL, 0
jpayne@69	828 )
jpayne@69	829 compressed = c.compress(block) + c.flush()
jpayne@69	830 del c
jpayne@69	831 if len(compressed) > 65536:
jpayne@69	832 raise RuntimeError(
jpayne@69	833 "TODO - Didn't compress enough, try less data in this block"
jpayne@69	834 )
jpayne@69	835 crc = zlib.crc32(block)
jpayne@69	836 # Should cope with a mix of Python platforms...
jpayne@69	837 if crc < 0:
jpayne@69	838 crc = struct.pack("<i", crc)
jpayne@69	839 else:
jpayne@69	840 crc = struct.pack("<I", crc)
jpayne@69	841 bsize = struct.pack("<H", len(compressed) + 25) # includes -1
jpayne@69	842 crc = struct.pack("<I", zlib.crc32(block) & 0xFFFFFFFF)
jpayne@69	843 uncompressed_length = struct.pack("<I", len(block))
jpayne@69	844 # Fixed 16 bytes,
jpayne@69	845 # gzip magic bytes (4) mod time (4),
jpayne@69	846 # gzip flag (1), os (1), extra length which is six (2),
jpayne@69	847 # sub field which is BC (2), sub field length of two (2),
jpayne@69	848 # Variable data,
jpayne@69	849 # 2 bytes: block length as BC sub field (2)
jpayne@69	850 # X bytes: the data
jpayne@69	851 # 8 bytes: crc (4), uncompressed data length (4)
jpayne@69	852 data = _bgzf_header + bsize + compressed + crc + uncompressed_length
jpayne@69	853 self._handle.write(data)
jpayne@69	854
jpayne@69	855 def write(self, data):
jpayne@69	856 """Write method for the class."""
jpayne@69	857 # TODO - Check bytes vs unicode
jpayne@69	858 if isinstance(data, str):
jpayne@69	859 # When reading we can't cope with multi-byte characters
jpayne@69	860 # being split between BGZF blocks, so we restrict to a
jpayne@69	861 # single byte encoding - like ASCII or latin-1.
jpayne@69	862 # On output we could probably allow any encoding, as we
jpayne@69	863 # don't care about splitting unicode characters between blocks
jpayne@69	864 data = data.encode("latin-1")
jpayne@69	865 # block_size = 2**16 = 65536
jpayne@69	866 data_len = len(data)
jpayne@69	867 if len(self._buffer) + data_len < 65536:
jpayne@69	868 # print("Cached %r" % data)
jpayne@69	869 self._buffer += data
jpayne@69	870 else:
jpayne@69	871 # print("Got %r, writing out some data..." % data)
jpayne@69	872 self._buffer += data
jpayne@69	873 while len(self._buffer) >= 65536:
jpayne@69	874 self._write_block(self._buffer[:65536])
jpayne@69	875 self._buffer = self._buffer[65536:]
jpayne@69	876
jpayne@69	877 def flush(self):
jpayne@69	878 """Flush data explicitally."""
jpayne@69	879 while len(self._buffer) >= 65536:
jpayne@69	880 self._write_block(self._buffer[:65535])
jpayne@69	881 self._buffer = self._buffer[65535:]
jpayne@69	882 self._write_block(self._buffer)
jpayne@69	883 self._buffer = b""
jpayne@69	884 self._handle.flush()
jpayne@69	885
jpayne@69	886 def close(self):
jpayne@69	887 """Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.
jpayne@69	888
jpayne@69	889 samtools will look for a magic EOF marker, just a 28 byte empty BGZF
jpayne@69	890 block, and if it is missing warns the BAM file may be truncated. In
jpayne@69	891 addition to samtools writing this block, so too does bgzip - so this
jpayne@69	892 implementation does too.
jpayne@69	893 """
jpayne@69	894 if self._buffer:
jpayne@69	895 self.flush()
jpayne@69	896 self._handle.write(_bgzf_eof)
jpayne@69	897 self._handle.flush()
jpayne@69	898 self._handle.close()
jpayne@69	899
jpayne@69	900 def tell(self):
jpayne@69	901 """Return a BGZF 64-bit virtual offset."""
jpayne@69	902 return make_virtual_offset(self._handle.tell(), len(self._buffer))
jpayne@69	903
jpayne@69	904 def seekable(self):
jpayne@69	905 """Return True indicating the BGZF supports random access."""
jpayne@69	906 # Not seekable, but we do support tell...
jpayne@69	907 return False
jpayne@69	908
jpayne@69	909 def isatty(self):
jpayne@69	910 """Return True if connected to a TTY device."""
jpayne@69	911 return False
jpayne@69	912
jpayne@69	913 def fileno(self):
jpayne@69	914 """Return integer file descriptor."""
jpayne@69	915 return self._handle.fileno()
jpayne@69	916
jpayne@69	917 def __enter__(self):
jpayne@69	918 """Open a file operable with WITH statement."""
jpayne@69	919 return self
jpayne@69	920
jpayne@69	921 def __exit__(self, type, value, traceback):
jpayne@69	922 """Close a file with WITH statement."""
jpayne@69	923 self.close()
jpayne@69	924
jpayne@69	925
jpayne@69	926 if __name__ == "__main__":
jpayne@69	927 if len(sys.argv) > 1:
jpayne@69	928 print("Call this with no arguments and pipe uncompressed data in on stdin")
jpayne@69	929 print("and it will produce BGZF compressed data on stdout. e.g.")
jpayne@69	930 print("")
jpayne@69	931 print("./bgzf.py < example.fastq > example.fastq.bgz")
jpayne@69	932 print("")
jpayne@69	933 print("The extension convention of .bgz is to distinugish these from .gz")
jpayne@69	934 print("used for standard gzipped files without the block structure of BGZF.")
jpayne@69	935 print("You can use the standard gunzip command to decompress BGZF files,")
jpayne@69	936 print("if it complains about the extension try something like this:")
jpayne@69	937 print("")
jpayne@69	938 print("cat example.fastq.bgz \| gunzip > example.fastq")
jpayne@69	939 print("")
jpayne@69	940 print("See also the tool bgzip that comes with samtools")
jpayne@69	941 sys.exit(0)
jpayne@69	942
jpayne@69	943 # Ensure we have binary mode handles
jpayne@69	944 # (leave stderr as default text mode)
jpayne@69	945 stdin = sys.stdin.buffer
jpayne@69	946 stdout = sys.stdout.buffer
jpayne@69	947
jpayne@69	948 sys.stderr.write("Producing BGZF output from stdin...\n")
jpayne@69	949 w = BgzfWriter(fileobj=stdout)
jpayne@69	950 while True:
jpayne@69	951 data = stdin.read(65536)
jpayne@69	952 w.write(data)
jpayne@69	953 if not data:
jpayne@69	954 break
jpayne@69	955 # Doing close will write an empty BGZF block as EOF marker:
jpayne@69	956 w.close()
jpayne@69	957 sys.stderr.write("BGZF data produced\n")

Mercurial > repos > rliterman > csp2

annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py @ 69:33d812a61356