comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 #!/usr/bin/env python
2 # Copyright 2010-2018 by Peter Cock.
3 # All rights reserved.
4 #
5 # This file is part of the Biopython distribution and governed by your
6 # choice of the "Biopython License Agreement" or the "BSD 3-Clause License".
7 # Please see the LICENSE file that should have been included as part of this
8 # package.
9 r"""Read and write BGZF compressed files (the GZIP variant used in BAM).
10
11 The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text
12 format (SAM), and a compressed binary format (BAM). The latter uses a
13 modified form of gzip compression called BGZF (Blocked GNU Zip Format),
14 which can be applied to any file format to provide compression with
15 efficient random access. BGZF is described together with the SAM/BAM
16 file format at https://samtools.sourceforge.net/SAM1.pdf
17
18 Please read the text below about 'virtual offsets' before using BGZF
19 files for random access.
20
21 Aim of this module
22 ------------------
23 The Python gzip library can be used to read BGZF files, since for
24 decompression they are just (specialised) gzip files. What this
25 module aims to facilitate is random access to BGZF files (using the
26 'virtual offset' idea), and writing BGZF files (which means using
27 suitably sized gzip blocks and writing the extra 'BC' field in the
28 gzip headers). As in the gzip library, the zlib library is used
29 internally.
30
31 In addition to being required for random access to and writing of
32 BAM files, the BGZF format can also be used on other sequential
33 data (in the sense of one record after another), such as most of
34 the sequence data formats supported in Bio.SeqIO (like FASTA,
35 FASTQ, GenBank, etc) or large MAF alignments.
36
37 The Bio.SeqIO indexing functions use this module to support BGZF files.
38
39 Technical Introduction to BGZF
40 ------------------------------
41 The gzip file format allows multiple compressed blocks, each of which
42 could be a stand alone gzip file. As an interesting bonus, this means
43 you can use Unix ``cat`` to combine two or more gzip files into one by
44 concatenating them. Also, each block can have one of several compression
45 levels (including uncompressed, which actually takes up a little bit
46 more space due to the gzip header).
47
48 What the BAM designers realised was that while random access to data
49 stored in traditional gzip files was slow, breaking the file into
50 gzip blocks would allow fast random access to each block. To access
51 a particular piece of the decompressed data, you just need to know
52 which block it starts in (the offset of the gzip block start), and
53 how far into the (decompressed) contents of the block you need to
54 read.
55
56 One problem with this is finding the gzip block sizes efficiently.
57 You can do it with a standard gzip file, but it requires every block
58 to be decompressed -- and that would be rather slow. Additionally
59 typical gzip files may use very large blocks.
60
61 All that differs in BGZF is that compressed size of each gzip block
62 is limited to 2^16 bytes, and an extra 'BC' field in the gzip header
63 records this size. Traditional decompression tools can ignore this,
64 and unzip the file just like any other gzip file.
65
66 The point of this is you can look at the first BGZF block, find out
67 how big it is from this 'BC' header, and thus seek immediately to
68 the second block, and so on.
69
70 The BAM indexing scheme records read positions using a 64 bit
71 'virtual offset', comprising ``coffset << 16 | uoffset``, where ``coffset``
72 is the file offset of the BGZF block containing the start of the read
73 (unsigned integer using up to 64-16 = 48 bits), and ``uoffset`` is the
74 offset within the (decompressed) block (unsigned 16 bit integer).
75
76 This limits you to BAM files where the last block starts by 2^48
77 bytes, or 256 petabytes, and the decompressed size of each block
78 is at most 2^16 bytes, or 64kb. Note that this matches the BGZF
79 'BC' field size which limits the compressed size of each block to
80 2^16 bytes, allowing for BAM files to use BGZF with no gzip
81 compression (useful for intermediate files in memory to reduce
82 CPU load).
83
84 Warning about namespaces
85 ------------------------
86 It is considered a bad idea to use "from XXX import ``*``" in Python, because
87 it pollutes the namespace. This is a real issue with Bio.bgzf (and the
88 standard Python library gzip) because they contain a function called open
89 i.e. Suppose you do this:
90
91 >>> from Bio.bgzf import *
92 >>> print(open.__module__)
93 Bio.bgzf
94
95 Or,
96
97 >>> from gzip import *
98 >>> print(open.__module__)
99 gzip
100
101 Notice that the open function has been replaced. You can "fix" this if you
102 need to by importing the built-in open function:
103
104 >>> from builtins import open
105
106 However, what we recommend instead is to use the explicit namespace, e.g.
107
108 >>> from Bio import bgzf
109 >>> print(bgzf.open.__module__)
110 Bio.bgzf
111
112
113 Examples
114 --------
115 This is an ordinary GenBank file compressed using BGZF, so it can
116 be decompressed using gzip,
117
118 >>> import gzip
119 >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r")
120 >>> assert 0 == handle.tell()
121 >>> line = handle.readline()
122 >>> assert 80 == handle.tell()
123 >>> line = handle.readline()
124 >>> assert 143 == handle.tell()
125 >>> data = handle.read(70000)
126 >>> assert 70143 == handle.tell()
127 >>> handle.close()
128
129 We can also access the file using the BGZF reader - but pay
130 attention to the file offsets which will be explained below:
131
132 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r")
133 >>> assert 0 == handle.tell()
134 >>> print(handle.readline().rstrip())
135 LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009
136 >>> assert 80 == handle.tell()
137 >>> print(handle.readline().rstrip())
138 DEFINITION Arabidopsis thaliana chloroplast, complete genome.
139 >>> assert 143 == handle.tell()
140 >>> data = handle.read(70000)
141 >>> assert 987828735 == handle.tell()
142 >>> print(handle.readline().rstrip())
143 f="GeneID:844718"
144 >>> print(handle.readline().rstrip())
145 CDS complement(join(84337..84771,85454..85843))
146 >>> offset = handle.seek(make_virtual_offset(55074, 126))
147 >>> print(handle.readline().rstrip())
148 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat
149 >>> handle.close()
150
151 Notice the handle's offset looks different as a BGZF file. This
152 brings us to the key point about BGZF, which is the block structure:
153
154 >>> handle = open("GenBank/NC_000932.gb.bgz", "rb")
155 >>> for values in BgzfBlocks(handle):
156 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
157 Raw start 0, raw length 15073; data start 0, data length 65536
158 Raw start 15073, raw length 17857; data start 65536, data length 65536
159 Raw start 32930, raw length 22144; data start 131072, data length 65536
160 Raw start 55074, raw length 22230; data start 196608, data length 65536
161 Raw start 77304, raw length 14939; data start 262144, data length 43478
162 Raw start 92243, raw length 28; data start 305622, data length 0
163 >>> handle.close()
164
165 In this example the first three blocks are 'full' and hold 65536 bytes
166 of uncompressed data. The fourth block isn't full and holds 43478 bytes.
167 Finally there is a special empty fifth block which takes 28 bytes on
168 disk and serves as an 'end of file' (EOF) marker. If this is missing,
169 it is possible your BGZF file is incomplete.
170
171 By reading ahead 70,000 bytes we moved into the second BGZF block,
172 and at that point the BGZF virtual offsets start to look different
173 to a simple offset into the decompressed data as exposed by the gzip
174 library.
175
176 As an example, consider seeking to the decompressed position 196734.
177 Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this
178 is equivalent to jumping the first three blocks (which in this
179 specific example are all size 65536 after decompression - which
180 does not always hold) and starting at byte 126 of the fourth block
181 (after decompression). For BGZF, we need to know the fourth block's
182 offset of 55074 and the offset within the block of 126 to get the
183 BGZF virtual offset.
184
185 >>> print(55074 << 16 | 126)
186 3609329790
187 >>> print(bgzf.make_virtual_offset(55074, 126))
188 3609329790
189
190 Thus for this BGZF file, decompressed position 196734 corresponds
191 to the virtual offset 3609329790. However, another BGZF file with
192 different contents would have compressed more or less efficiently,
193 so the compressed blocks would be different sizes. What this means
194 is the mapping between the uncompressed offset and the compressed
195 virtual offset depends on the BGZF file you are using.
196
197 If you are accessing a BGZF file via this module, just use the
198 handle.tell() method to note the virtual offset of a position you
199 may later want to return to using handle.seek().
200
201 The catch with BGZF virtual offsets is while they can be compared
202 (which offset comes first in the file), you cannot safely subtract
203 them to get the size of the data between them, nor add/subtract
204 a relative offset.
205
206 Of course you can parse this file with Bio.SeqIO using BgzfReader,
207 although there isn't any benefit over using gzip.open(...), unless
208 you want to index BGZF compressed sequence files:
209
210 >>> from Bio import SeqIO
211 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz")
212 >>> record = SeqIO.read(handle, "genbank")
213 >>> handle.close()
214 >>> print(record.id)
215 NC_000932.1
216
217 Text Mode
218 ---------
219
220 Like the standard library gzip.open(...), the BGZF code defaults to opening
221 files in binary mode.
222
223 You can request the file be opened in text mode, but beware that this is hard
224 coded to the simple "latin1" (aka "iso-8859-1") encoding (which includes all
225 the ASCII characters), which works well with most Western European languages.
226 However, it is not fully compatible with the more widely used UTF-8 encoding.
227
228 In variable width encodings like UTF-8, some single characters in the unicode
229 text output are represented by multiple bytes in the raw binary form. This is
230 problematic with BGZF, as we cannot always decode each block in isolation - a
231 single unicode character could be split over two blocks. This can even happen
232 with fixed width unicode encodings, as the BGZF block size is not fixed.
233
234 Therefore, this module is currently restricted to only support single byte
235 unicode encodings, such as ASCII, "latin1" (which is a superset of ASCII), or
236 potentially other character maps (not implemented).
237
238 Furthermore, unlike the default text mode on Python 3, we do not attempt to
239 implement universal new line mode. This transforms the various operating system
240 new line conventions like Windows (CR LF or "\r\n"), Unix (just LF, "\n"), or
241 old Macs (just CR, "\r"), into just LF ("\n"). Here we have the same problem -
242 is "\r" at the end of a block an incomplete Windows style new line?
243
244 Instead, you will get the CR ("\r") and LF ("\n") characters as is.
245
246 If your data is in UTF-8 or any other incompatible encoding, you must use
247 binary mode, and decode the appropriate fragments yourself.
248 """
249
250 import struct
251 import sys
252 import zlib
253
254 from builtins import open as _open
255
256 _bgzf_magic = b"\x1f\x8b\x08\x04"
257 _bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00"
258 _bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00"
259 _bytes_BC = b"BC"
260
261
262 def open(filename, mode="rb"):
263 r"""Open a BGZF file for reading, writing or appending.
264
265 If text mode is requested, in order to avoid multi-byte characters, this is
266 hard coded to use the "latin1" encoding, and "\r" and "\n" are passed as is
267 (without implementing universal new line mode).
268
269 If your data is in UTF-8 or any other incompatible encoding, you must use
270 binary mode, and decode the appropriate fragments yourself.
271 """
272 if "r" in mode.lower():
273 return BgzfReader(filename, mode)
274 elif "w" in mode.lower() or "a" in mode.lower():
275 return BgzfWriter(filename, mode)
276 else:
277 raise ValueError(f"Bad mode {mode!r}")
278
279
280 def make_virtual_offset(block_start_offset, within_block_offset):
281 """Compute a BGZF virtual offset from block start and within block offsets.
282
283 The BAM indexing scheme records read positions using a 64 bit
284 'virtual offset', comprising in C terms:
285
286 block_start_offset << 16 | within_block_offset
287
288 Here block_start_offset is the file offset of the BGZF block
289 start (unsigned integer using up to 64-16 = 48 bits), and
290 within_block_offset within the (decompressed) block (unsigned
291 16 bit integer).
292
293 >>> make_virtual_offset(0, 0)
294 0
295 >>> make_virtual_offset(0, 1)
296 1
297 >>> make_virtual_offset(0, 2**16 - 1)
298 65535
299 >>> make_virtual_offset(0, 2**16)
300 Traceback (most recent call last):
301 ...
302 ValueError: Require 0 <= within_block_offset < 2**16, got 65536
303
304 >>> 65536 == make_virtual_offset(1, 0)
305 True
306 >>> 65537 == make_virtual_offset(1, 1)
307 True
308 >>> 131071 == make_virtual_offset(1, 2**16 - 1)
309 True
310
311 >>> 6553600000 == make_virtual_offset(100000, 0)
312 True
313 >>> 6553600001 == make_virtual_offset(100000, 1)
314 True
315 >>> 6553600010 == make_virtual_offset(100000, 10)
316 True
317
318 >>> make_virtual_offset(2**48, 0)
319 Traceback (most recent call last):
320 ...
321 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656
322
323 """
324 if within_block_offset < 0 or within_block_offset >= 65536:
325 raise ValueError(
326 "Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset
327 )
328 if block_start_offset < 0 or block_start_offset >= 281474976710656:
329 raise ValueError(
330 "Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset
331 )
332 return (block_start_offset << 16) | within_block_offset
333
334
335 def split_virtual_offset(virtual_offset):
336 """Divides a 64-bit BGZF virtual offset into block start & within block offsets.
337
338 >>> (100000, 0) == split_virtual_offset(6553600000)
339 True
340 >>> (100000, 10) == split_virtual_offset(6553600010)
341 True
342
343 """
344 start = virtual_offset >> 16
345 return start, virtual_offset ^ (start << 16)
346
347
348 def BgzfBlocks(handle):
349 """Low level debugging function to inspect BGZF blocks.
350
351 Expects a BGZF compressed file opened in binary read mode using
352 the builtin open function. Do not use a handle from this bgzf
353 module or the gzip module's open function which will decompress
354 the file.
355
356 Returns the block start offset (see virtual offsets), the block
357 length (add these for the start of the next block), and the
358 decompressed length of the blocks contents (limited to 65536 in
359 BGZF), as an iterator - one tuple per BGZF block.
360
361 >>> from builtins import open
362 >>> handle = open("SamBam/ex1.bam", "rb")
363 >>> for values in BgzfBlocks(handle):
364 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
365 Raw start 0, raw length 18239; data start 0, data length 65536
366 Raw start 18239, raw length 18223; data start 65536, data length 65536
367 Raw start 36462, raw length 18017; data start 131072, data length 65536
368 Raw start 54479, raw length 17342; data start 196608, data length 65536
369 Raw start 71821, raw length 17715; data start 262144, data length 65536
370 Raw start 89536, raw length 17728; data start 327680, data length 65536
371 Raw start 107264, raw length 17292; data start 393216, data length 63398
372 Raw start 124556, raw length 28; data start 456614, data length 0
373 >>> handle.close()
374
375 Indirectly we can tell this file came from an old version of
376 samtools because all the blocks (except the final one and the
377 dummy empty EOF marker block) are 65536 bytes. Later versions
378 avoid splitting a read between two blocks, and give the header
379 its own block (useful to speed up replacing the header). You
380 can see this in ex1_refresh.bam created using samtools 0.1.18:
381
382 samtools view -b ex1.bam > ex1_refresh.bam
383
384 >>> handle = open("SamBam/ex1_refresh.bam", "rb")
385 >>> for values in BgzfBlocks(handle):
386 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
387 Raw start 0, raw length 53; data start 0, data length 38
388 Raw start 53, raw length 18195; data start 38, data length 65434
389 Raw start 18248, raw length 18190; data start 65472, data length 65409
390 Raw start 36438, raw length 18004; data start 130881, data length 65483
391 Raw start 54442, raw length 17353; data start 196364, data length 65519
392 Raw start 71795, raw length 17708; data start 261883, data length 65411
393 Raw start 89503, raw length 17709; data start 327294, data length 65466
394 Raw start 107212, raw length 17390; data start 392760, data length 63854
395 Raw start 124602, raw length 28; data start 456614, data length 0
396 >>> handle.close()
397
398 The above example has no embedded SAM header (thus the first block
399 is very small at just 38 bytes of decompressed data), while the next
400 example does (a larger block of 103 bytes). Notice that the rest of
401 the blocks show the same sizes (they contain the same read data):
402
403 >>> handle = open("SamBam/ex1_header.bam", "rb")
404 >>> for values in BgzfBlocks(handle):
405 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
406 Raw start 0, raw length 104; data start 0, data length 103
407 Raw start 104, raw length 18195; data start 103, data length 65434
408 Raw start 18299, raw length 18190; data start 65537, data length 65409
409 Raw start 36489, raw length 18004; data start 130946, data length 65483
410 Raw start 54493, raw length 17353; data start 196429, data length 65519
411 Raw start 71846, raw length 17708; data start 261948, data length 65411
412 Raw start 89554, raw length 17709; data start 327359, data length 65466
413 Raw start 107263, raw length 17390; data start 392825, data length 63854
414 Raw start 124653, raw length 28; data start 456679, data length 0
415 >>> handle.close()
416
417 """
418 if isinstance(handle, BgzfReader):
419 raise TypeError("Function BgzfBlocks expects a binary handle")
420 data_start = 0
421 while True:
422 start_offset = handle.tell()
423 try:
424 block_length, data = _load_bgzf_block(handle)
425 except StopIteration:
426 break
427 data_len = len(data)
428 yield start_offset, block_length, data_start, data_len
429 data_start += data_len
430
431
432 def _load_bgzf_block(handle, text_mode=False):
433 """Load the next BGZF block of compressed data (PRIVATE).
434
435 Returns a tuple (block size and data), or at end of file
436 will raise StopIteration.
437 """
438 magic = handle.read(4)
439 if not magic:
440 # End of file - should we signal this differently now?
441 # See https://www.python.org/dev/peps/pep-0479/
442 raise StopIteration
443 if magic != _bgzf_magic:
444 raise ValueError(
445 r"A BGZF (e.g. a BAM file) block should start with "
446 r"%r, not %r; handle.tell() now says %r"
447 % (_bgzf_magic, magic, handle.tell())
448 )
449 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = struct.unpack(
450 "<LBBH", handle.read(8)
451 )
452
453 block_size = None
454 x_len = 0
455 while x_len < extra_len:
456 subfield_id = handle.read(2)
457 subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t
458 subfield_data = handle.read(subfield_len)
459 x_len += subfield_len + 4
460 if subfield_id == _bytes_BC:
461 if subfield_len != 2:
462 raise ValueError("Wrong BC payload length")
463 if block_size is not None:
464 raise ValueError("Two BC subfields?")
465 block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t
466 if x_len != extra_len:
467 raise ValueError(f"x_len and extra_len differ {x_len}, {extra_len}")
468 if block_size is None:
469 raise ValueError("Missing BC, this isn't a BGZF file!")
470 # Now comes the compressed data, CRC, and length of uncompressed data.
471 deflate_size = block_size - 1 - extra_len - 19
472 d = zlib.decompressobj(-15) # Negative window size means no headers
473 data = d.decompress(handle.read(deflate_size)) + d.flush()
474 expected_crc = handle.read(4)
475 expected_size = struct.unpack("<I", handle.read(4))[0]
476 if expected_size != len(data):
477 raise RuntimeError("Decompressed to %i, not %i" % (len(data), expected_size))
478 # Should cope with a mix of Python platforms...
479 crc = zlib.crc32(data)
480 if crc < 0:
481 crc = struct.pack("<i", crc)
482 else:
483 crc = struct.pack("<I", crc)
484 if expected_crc != crc:
485 raise RuntimeError(f"CRC is {crc}, not {expected_crc}")
486 if text_mode:
487 # Note ISO-8859-1 aka Latin-1 preserves first 256 chars
488 # (i.e. ASCII), but critically is a single byte encoding
489 return block_size, data.decode("latin-1")
490 else:
491 return block_size, data
492
493
494 class BgzfReader:
495 r"""BGZF reader, acts like a read only handle but seek/tell differ.
496
497 Let's use the BgzfBlocks function to have a peek at the BGZF blocks
498 in an example BAM file,
499
500 >>> from builtins import open
501 >>> handle = open("SamBam/ex1.bam", "rb")
502 >>> for values in BgzfBlocks(handle):
503 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
504 Raw start 0, raw length 18239; data start 0, data length 65536
505 Raw start 18239, raw length 18223; data start 65536, data length 65536
506 Raw start 36462, raw length 18017; data start 131072, data length 65536
507 Raw start 54479, raw length 17342; data start 196608, data length 65536
508 Raw start 71821, raw length 17715; data start 262144, data length 65536
509 Raw start 89536, raw length 17728; data start 327680, data length 65536
510 Raw start 107264, raw length 17292; data start 393216, data length 63398
511 Raw start 124556, raw length 28; data start 456614, data length 0
512 >>> handle.close()
513
514 Now let's see how to use this block information to jump to
515 specific parts of the decompressed BAM file:
516
517 >>> handle = BgzfReader("SamBam/ex1.bam", "rb")
518 >>> assert 0 == handle.tell()
519 >>> magic = handle.read(4)
520 >>> assert 4 == handle.tell()
521
522 So far nothing so strange, we got the magic marker used at the
523 start of a decompressed BAM file, and the handle position makes
524 sense. Now however, let's jump to the end of this block and 4
525 bytes into the next block by reading 65536 bytes,
526
527 >>> data = handle.read(65536)
528 >>> len(data)
529 65536
530 >>> assert 1195311108 == handle.tell()
531
532 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit
533 virtual offset, which means:
534
535 >>> split_virtual_offset(1195311108)
536 (18239, 4)
537
538 You should spot 18239 as the start of the second BGZF block, while
539 the 4 is the offset into this block. See also make_virtual_offset,
540
541 >>> make_virtual_offset(18239, 4)
542 1195311108
543
544 Let's jump back to almost the start of the file,
545
546 >>> make_virtual_offset(0, 2)
547 2
548 >>> handle.seek(2)
549 2
550 >>> handle.close()
551
552 Note that you can use the max_cache argument to limit the number of
553 BGZF blocks cached in memory. The default is 100, and since each
554 block can be up to 64kb, the default cache could take up to 6MB of
555 RAM. The cache is not important for reading through the file in one
556 pass, but is important for improving performance of random access.
557 """
558
559 def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
560 r"""Initialize the class for reading a BGZF file.
561
562 You would typically use the top level ``bgzf.open(...)`` function
563 which will call this class internally. Direct use is discouraged.
564
565 Either the ``filename`` (string) or ``fileobj`` (input file object in
566 binary mode) arguments must be supplied, but not both.
567
568 Argument ``mode`` controls if the data will be returned as strings in
569 text mode ("rt", "tr", or default "r"), or bytes binary mode ("rb"
570 or "br"). The argument name matches the built-in ``open(...)`` and
571 standard library ``gzip.open(...)`` function.
572
573 If text mode is requested, in order to avoid multi-byte characters,
574 this is hard coded to use the "latin1" encoding, and "\r" and "\n"
575 are passed as is (without implementing universal new line mode). There
576 is no ``encoding`` argument.
577
578 If your data is in UTF-8 or any other incompatible encoding, you must
579 use binary mode, and decode the appropriate fragments yourself.
580
581 Argument ``max_cache`` controls the maximum number of BGZF blocks to
582 cache in memory. Each can be up to 64kb thus the default of 100 blocks
583 could take up to 6MB of RAM. This is important for efficient random
584 access, a small value is fine for reading the file in one pass.
585 """
586 # TODO - Assuming we can seek, check for 28 bytes EOF empty block
587 # and if missing warn about possible truncation (as in samtools)?
588 if max_cache < 1:
589 raise ValueError("Use max_cache with a minimum of 1")
590 # Must open the BGZF file in binary mode, but we may want to
591 # treat the contents as either text or binary (unicode or
592 # bytes under Python 3)
593 if filename and fileobj:
594 raise ValueError("Supply either filename or fileobj, not both")
595 # Want to reject output modes like w, a, x, +
596 if mode.lower() not in ("r", "tr", "rt", "rb", "br"):
597 raise ValueError(
598 "Must use a read mode like 'r' (default), 'rt', or 'rb' for binary"
599 )
600 # If an open file was passed, make sure it was opened in binary mode.
601 if fileobj:
602 if fileobj.read(0) != b"":
603 raise ValueError("fileobj not opened in binary mode")
604 handle = fileobj
605 else:
606 handle = _open(filename, "rb")
607 self._text = "b" not in mode.lower()
608 if self._text:
609 self._newline = "\n"
610 else:
611 self._newline = b"\n"
612 self._handle = handle
613 self.max_cache = max_cache
614 self._buffers = {}
615 self._block_start_offset = None
616 self._block_raw_length = None
617 self._load_block(handle.tell())
618
619 def _load_block(self, start_offset=None):
620 if start_offset is None:
621 # If the file is being read sequentially, then _handle.tell()
622 # should be pointing at the start of the next block.
623 # However, if seek has been used, we can't assume that.
624 start_offset = self._block_start_offset + self._block_raw_length
625 if start_offset == self._block_start_offset:
626 self._within_block_offset = 0
627 return
628 elif start_offset in self._buffers:
629 # Already in cache
630 self._buffer, self._block_raw_length = self._buffers[start_offset]
631 self._within_block_offset = 0
632 self._block_start_offset = start_offset
633 return
634 # Must hit the disk... first check cache limits,
635 while len(self._buffers) >= self.max_cache:
636 # TODO - Implement LRU cache removal?
637 self._buffers.popitem()
638 # Now load the block
639 handle = self._handle
640 if start_offset is not None:
641 handle.seek(start_offset)
642 self._block_start_offset = handle.tell()
643 try:
644 block_size, self._buffer = _load_bgzf_block(handle, self._text)
645 except StopIteration:
646 # EOF
647 block_size = 0
648 if self._text:
649 self._buffer = ""
650 else:
651 self._buffer = b""
652 self._within_block_offset = 0
653 self._block_raw_length = block_size
654 # Finally save the block in our cache,
655 self._buffers[self._block_start_offset] = self._buffer, block_size
656
657 def tell(self):
658 """Return a 64-bit unsigned BGZF virtual offset."""
659 if 0 < self._within_block_offset and self._within_block_offset == len(
660 self._buffer
661 ):
662 # Special case where we're right at the end of a (non empty) block.
663 # For non-maximal blocks could give two possible virtual offsets,
664 # but for a maximal block can't use 65536 as the within block
665 # offset. Therefore for consistency, use the next block and a
666 # within block offset of zero.
667 return (self._block_start_offset + self._block_raw_length) << 16
668 else:
669 # return make_virtual_offset(self._block_start_offset,
670 # self._within_block_offset)
671 # TODO - Include bounds checking as in make_virtual_offset?
672 return (self._block_start_offset << 16) | self._within_block_offset
673
674 def seek(self, virtual_offset):
675 """Seek to a 64-bit unsigned BGZF virtual offset."""
676 # Do this inline to avoid a function call,
677 # start_offset, within_block = split_virtual_offset(virtual_offset)
678 start_offset = virtual_offset >> 16
679 within_block = virtual_offset ^ (start_offset << 16)
680 if start_offset != self._block_start_offset:
681 # Don't need to load the block if already there
682 # (this avoids a function call since _load_block would do nothing)
683 self._load_block(start_offset)
684 if start_offset != self._block_start_offset:
685 raise ValueError("start_offset not loaded correctly")
686 if within_block > len(self._buffer):
687 if not (within_block == 0 and len(self._buffer) == 0):
688 raise ValueError(
689 "Within offset %i but block size only %i"
690 % (within_block, len(self._buffer))
691 )
692 self._within_block_offset = within_block
693 # assert virtual_offset == self.tell(), \
694 # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \
695 # % (virtual_offset, start_offset, within_block,
696 # self.tell(), self._block_start_offset,
697 # self._within_block_offset)
698 return virtual_offset
699
700 def read(self, size=-1):
701 """Read method for the BGZF module."""
702 if size < 0:
703 raise NotImplementedError("Don't be greedy, that could be massive!")
704
705 result = "" if self._text else b""
706 while size and self._block_raw_length:
707 if self._within_block_offset + size <= len(self._buffer):
708 # This may leave us right at the end of a block
709 # (lazy loading, don't load the next block unless we have too)
710 data = self._buffer[
711 self._within_block_offset : self._within_block_offset + size
712 ]
713 self._within_block_offset += size
714 if not data:
715 raise ValueError("Must be at least 1 byte")
716 result += data
717 break
718 else:
719 data = self._buffer[self._within_block_offset :]
720 size -= len(data)
721 self._load_block() # will reset offsets
722 result += data
723
724 return result
725
726 def readline(self):
727 """Read a single line for the BGZF file."""
728 result = "" if self._text else b""
729 while self._block_raw_length:
730 i = self._buffer.find(self._newline, self._within_block_offset)
731 # Three cases to consider,
732 if i == -1:
733 # No newline, need to read in more data
734 data = self._buffer[self._within_block_offset :]
735 self._load_block() # will reset offsets
736 result += data
737 elif i + 1 == len(self._buffer):
738 # Found new line, but right at end of block (SPECIAL)
739 data = self._buffer[self._within_block_offset :]
740 # Must now load the next block to ensure tell() works
741 self._load_block() # will reset offsets
742 if not data:
743 raise ValueError("Must be at least 1 byte")
744 result += data
745 break
746 else:
747 # Found new line, not at end of block (easy case, no IO)
748 data = self._buffer[self._within_block_offset : i + 1]
749 self._within_block_offset = i + 1
750 # assert data.endswith(self._newline)
751 result += data
752 break
753
754 return result
755
756 def __next__(self):
757 """Return the next line."""
758 line = self.readline()
759 if not line:
760 raise StopIteration
761 return line
762
763 def __iter__(self):
764 """Iterate over the lines in the BGZF file."""
765 return self
766
767 def close(self):
768 """Close BGZF file."""
769 self._handle.close()
770 self._buffer = None
771 self._block_start_offset = None
772 self._buffers = None
773
774 def seekable(self):
775 """Return True indicating the BGZF supports random access."""
776 return True
777
778 def isatty(self):
779 """Return True if connected to a TTY device."""
780 return False
781
782 def fileno(self):
783 """Return integer file descriptor."""
784 return self._handle.fileno()
785
786 def __enter__(self):
787 """Open a file operable with WITH statement."""
788 return self
789
790 def __exit__(self, type, value, traceback):
791 """Close a file with WITH statement."""
792 self.close()
793
794
795 class BgzfWriter:
796 """Define a BGZFWriter object."""
797
798 def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
799 """Initilize the class."""
800 if filename and fileobj:
801 raise ValueError("Supply either filename or fileobj, not both")
802 # If an open file was passed, make sure it was opened in binary mode.
803 if fileobj:
804 if fileobj.read(0) != b"":
805 raise ValueError("fileobj not opened in binary mode")
806 handle = fileobj
807 else:
808 if "w" not in mode.lower() and "a" not in mode.lower():
809 raise ValueError(f"Must use write or append mode, not {mode!r}")
810 if "a" in mode.lower():
811 handle = _open(filename, "ab")
812 else:
813 handle = _open(filename, "wb")
814 self._text = "b" not in mode.lower()
815 self._handle = handle
816 self._buffer = b""
817 self.compresslevel = compresslevel
818
819 def _write_block(self, block):
820 """Write provided data to file as a single BGZF compressed block (PRIVATE)."""
821 # print("Saving %i bytes" % len(block))
822 if len(block) > 65536:
823 raise ValueError(f"{len(block)} Block length > 65536")
824 # Giving a negative window bits means no gzip/zlib headers,
825 # -15 used in samtools
826 c = zlib.compressobj(
827 self.compresslevel, zlib.DEFLATED, -15, zlib.DEF_MEM_LEVEL, 0
828 )
829 compressed = c.compress(block) + c.flush()
830 del c
831 if len(compressed) > 65536:
832 raise RuntimeError(
833 "TODO - Didn't compress enough, try less data in this block"
834 )
835 crc = zlib.crc32(block)
836 # Should cope with a mix of Python platforms...
837 if crc < 0:
838 crc = struct.pack("<i", crc)
839 else:
840 crc = struct.pack("<I", crc)
841 bsize = struct.pack("<H", len(compressed) + 25) # includes -1
842 crc = struct.pack("<I", zlib.crc32(block) & 0xFFFFFFFF)
843 uncompressed_length = struct.pack("<I", len(block))
844 # Fixed 16 bytes,
845 # gzip magic bytes (4) mod time (4),
846 # gzip flag (1), os (1), extra length which is six (2),
847 # sub field which is BC (2), sub field length of two (2),
848 # Variable data,
849 # 2 bytes: block length as BC sub field (2)
850 # X bytes: the data
851 # 8 bytes: crc (4), uncompressed data length (4)
852 data = _bgzf_header + bsize + compressed + crc + uncompressed_length
853 self._handle.write(data)
854
855 def write(self, data):
856 """Write method for the class."""
857 # TODO - Check bytes vs unicode
858 if isinstance(data, str):
859 # When reading we can't cope with multi-byte characters
860 # being split between BGZF blocks, so we restrict to a
861 # single byte encoding - like ASCII or latin-1.
862 # On output we could probably allow any encoding, as we
863 # don't care about splitting unicode characters between blocks
864 data = data.encode("latin-1")
865 # block_size = 2**16 = 65536
866 data_len = len(data)
867 if len(self._buffer) + data_len < 65536:
868 # print("Cached %r" % data)
869 self._buffer += data
870 else:
871 # print("Got %r, writing out some data..." % data)
872 self._buffer += data
873 while len(self._buffer) >= 65536:
874 self._write_block(self._buffer[:65536])
875 self._buffer = self._buffer[65536:]
876
877 def flush(self):
878 """Flush data explicitally."""
879 while len(self._buffer) >= 65536:
880 self._write_block(self._buffer[:65535])
881 self._buffer = self._buffer[65535:]
882 self._write_block(self._buffer)
883 self._buffer = b""
884 self._handle.flush()
885
886 def close(self):
887 """Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.
888
889 samtools will look for a magic EOF marker, just a 28 byte empty BGZF
890 block, and if it is missing warns the BAM file may be truncated. In
891 addition to samtools writing this block, so too does bgzip - so this
892 implementation does too.
893 """
894 if self._buffer:
895 self.flush()
896 self._handle.write(_bgzf_eof)
897 self._handle.flush()
898 self._handle.close()
899
900 def tell(self):
901 """Return a BGZF 64-bit virtual offset."""
902 return make_virtual_offset(self._handle.tell(), len(self._buffer))
903
904 def seekable(self):
905 """Return True indicating the BGZF supports random access."""
906 # Not seekable, but we do support tell...
907 return False
908
909 def isatty(self):
910 """Return True if connected to a TTY device."""
911 return False
912
913 def fileno(self):
914 """Return integer file descriptor."""
915 return self._handle.fileno()
916
917 def __enter__(self):
918 """Open a file operable with WITH statement."""
919 return self
920
921 def __exit__(self, type, value, traceback):
922 """Close a file with WITH statement."""
923 self.close()
924
925
926 if __name__ == "__main__":
927 if len(sys.argv) > 1:
928 print("Call this with no arguments and pipe uncompressed data in on stdin")
929 print("and it will produce BGZF compressed data on stdout. e.g.")
930 print("")
931 print("./bgzf.py < example.fastq > example.fastq.bgz")
932 print("")
933 print("The extension convention of *.bgz is to distinugish these from *.gz")
934 print("used for standard gzipped files without the block structure of BGZF.")
935 print("You can use the standard gunzip command to decompress BGZF files,")
936 print("if it complains about the extension try something like this:")
937 print("")
938 print("cat example.fastq.bgz | gunzip > example.fastq")
939 print("")
940 print("See also the tool bgzip that comes with samtools")
941 sys.exit(0)
942
943 # Ensure we have binary mode handles
944 # (leave stderr as default text mode)
945 stdin = sys.stdin.buffer
946 stdout = sys.stdout.buffer
947
948 sys.stderr.write("Producing BGZF output from stdin...\n")
949 w = BgzfWriter(fileobj=stdout)
950 while True:
951 data = stdin.read(65536)
952 w.write(data)
953 if not data:
954 break
955 # Doing close will write an empty BGZF block as EOF marker:
956 w.close()
957 sys.stderr.write("BGZF data produced\n")