Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/lib/python3.8/site-packages/Bio/bgzf.py @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 69:33d812a61356 |
---|---|
1 #!/usr/bin/env python | |
2 # Copyright 2010-2018 by Peter Cock. | |
3 # All rights reserved. | |
4 # | |
5 # This file is part of the Biopython distribution and governed by your | |
6 # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". | |
7 # Please see the LICENSE file that should have been included as part of this | |
8 # package. | |
9 r"""Read and write BGZF compressed files (the GZIP variant used in BAM). | |
10 | |
11 The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text | |
12 format (SAM), and a compressed binary format (BAM). The latter uses a | |
13 modified form of gzip compression called BGZF (Blocked GNU Zip Format), | |
14 which can be applied to any file format to provide compression with | |
15 efficient random access. BGZF is described together with the SAM/BAM | |
16 file format at https://samtools.sourceforge.net/SAM1.pdf | |
17 | |
18 Please read the text below about 'virtual offsets' before using BGZF | |
19 files for random access. | |
20 | |
21 Aim of this module | |
22 ------------------ | |
23 The Python gzip library can be used to read BGZF files, since for | |
24 decompression they are just (specialised) gzip files. What this | |
25 module aims to facilitate is random access to BGZF files (using the | |
26 'virtual offset' idea), and writing BGZF files (which means using | |
27 suitably sized gzip blocks and writing the extra 'BC' field in the | |
28 gzip headers). As in the gzip library, the zlib library is used | |
29 internally. | |
30 | |
31 In addition to being required for random access to and writing of | |
32 BAM files, the BGZF format can also be used on other sequential | |
33 data (in the sense of one record after another), such as most of | |
34 the sequence data formats supported in Bio.SeqIO (like FASTA, | |
35 FASTQ, GenBank, etc) or large MAF alignments. | |
36 | |
37 The Bio.SeqIO indexing functions use this module to support BGZF files. | |
38 | |
39 Technical Introduction to BGZF | |
40 ------------------------------ | |
41 The gzip file format allows multiple compressed blocks, each of which | |
42 could be a stand alone gzip file. As an interesting bonus, this means | |
43 you can use Unix ``cat`` to combine two or more gzip files into one by | |
44 concatenating them. Also, each block can have one of several compression | |
45 levels (including uncompressed, which actually takes up a little bit | |
46 more space due to the gzip header). | |
47 | |
48 What the BAM designers realised was that while random access to data | |
49 stored in traditional gzip files was slow, breaking the file into | |
50 gzip blocks would allow fast random access to each block. To access | |
51 a particular piece of the decompressed data, you just need to know | |
52 which block it starts in (the offset of the gzip block start), and | |
53 how far into the (decompressed) contents of the block you need to | |
54 read. | |
55 | |
56 One problem with this is finding the gzip block sizes efficiently. | |
57 You can do it with a standard gzip file, but it requires every block | |
58 to be decompressed -- and that would be rather slow. Additionally | |
59 typical gzip files may use very large blocks. | |
60 | |
61 All that differs in BGZF is that compressed size of each gzip block | |
62 is limited to 2^16 bytes, and an extra 'BC' field in the gzip header | |
63 records this size. Traditional decompression tools can ignore this, | |
64 and unzip the file just like any other gzip file. | |
65 | |
66 The point of this is you can look at the first BGZF block, find out | |
67 how big it is from this 'BC' header, and thus seek immediately to | |
68 the second block, and so on. | |
69 | |
70 The BAM indexing scheme records read positions using a 64 bit | |
71 'virtual offset', comprising ``coffset << 16 | uoffset``, where ``coffset`` | |
72 is the file offset of the BGZF block containing the start of the read | |
73 (unsigned integer using up to 64-16 = 48 bits), and ``uoffset`` is the | |
74 offset within the (decompressed) block (unsigned 16 bit integer). | |
75 | |
76 This limits you to BAM files where the last block starts by 2^48 | |
77 bytes, or 256 petabytes, and the decompressed size of each block | |
78 is at most 2^16 bytes, or 64kb. Note that this matches the BGZF | |
79 'BC' field size which limits the compressed size of each block to | |
80 2^16 bytes, allowing for BAM files to use BGZF with no gzip | |
81 compression (useful for intermediate files in memory to reduce | |
82 CPU load). | |
83 | |
84 Warning about namespaces | |
85 ------------------------ | |
86 It is considered a bad idea to use "from XXX import ``*``" in Python, because | |
87 it pollutes the namespace. This is a real issue with Bio.bgzf (and the | |
88 standard Python library gzip) because they contain a function called open | |
89 i.e. Suppose you do this: | |
90 | |
91 >>> from Bio.bgzf import * | |
92 >>> print(open.__module__) | |
93 Bio.bgzf | |
94 | |
95 Or, | |
96 | |
97 >>> from gzip import * | |
98 >>> print(open.__module__) | |
99 gzip | |
100 | |
101 Notice that the open function has been replaced. You can "fix" this if you | |
102 need to by importing the built-in open function: | |
103 | |
104 >>> from builtins import open | |
105 | |
106 However, what we recommend instead is to use the explicit namespace, e.g. | |
107 | |
108 >>> from Bio import bgzf | |
109 >>> print(bgzf.open.__module__) | |
110 Bio.bgzf | |
111 | |
112 | |
113 Examples | |
114 -------- | |
115 This is an ordinary GenBank file compressed using BGZF, so it can | |
116 be decompressed using gzip, | |
117 | |
118 >>> import gzip | |
119 >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r") | |
120 >>> assert 0 == handle.tell() | |
121 >>> line = handle.readline() | |
122 >>> assert 80 == handle.tell() | |
123 >>> line = handle.readline() | |
124 >>> assert 143 == handle.tell() | |
125 >>> data = handle.read(70000) | |
126 >>> assert 70143 == handle.tell() | |
127 >>> handle.close() | |
128 | |
129 We can also access the file using the BGZF reader - but pay | |
130 attention to the file offsets which will be explained below: | |
131 | |
132 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r") | |
133 >>> assert 0 == handle.tell() | |
134 >>> print(handle.readline().rstrip()) | |
135 LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009 | |
136 >>> assert 80 == handle.tell() | |
137 >>> print(handle.readline().rstrip()) | |
138 DEFINITION Arabidopsis thaliana chloroplast, complete genome. | |
139 >>> assert 143 == handle.tell() | |
140 >>> data = handle.read(70000) | |
141 >>> assert 987828735 == handle.tell() | |
142 >>> print(handle.readline().rstrip()) | |
143 f="GeneID:844718" | |
144 >>> print(handle.readline().rstrip()) | |
145 CDS complement(join(84337..84771,85454..85843)) | |
146 >>> offset = handle.seek(make_virtual_offset(55074, 126)) | |
147 >>> print(handle.readline().rstrip()) | |
148 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat | |
149 >>> handle.close() | |
150 | |
151 Notice the handle's offset looks different as a BGZF file. This | |
152 brings us to the key point about BGZF, which is the block structure: | |
153 | |
154 >>> handle = open("GenBank/NC_000932.gb.bgz", "rb") | |
155 >>> for values in BgzfBlocks(handle): | |
156 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) | |
157 Raw start 0, raw length 15073; data start 0, data length 65536 | |
158 Raw start 15073, raw length 17857; data start 65536, data length 65536 | |
159 Raw start 32930, raw length 22144; data start 131072, data length 65536 | |
160 Raw start 55074, raw length 22230; data start 196608, data length 65536 | |
161 Raw start 77304, raw length 14939; data start 262144, data length 43478 | |
162 Raw start 92243, raw length 28; data start 305622, data length 0 | |
163 >>> handle.close() | |
164 | |
165 In this example the first three blocks are 'full' and hold 65536 bytes | |
166 of uncompressed data. The fourth block isn't full and holds 43478 bytes. | |
167 Finally there is a special empty fifth block which takes 28 bytes on | |
168 disk and serves as an 'end of file' (EOF) marker. If this is missing, | |
169 it is possible your BGZF file is incomplete. | |
170 | |
171 By reading ahead 70,000 bytes we moved into the second BGZF block, | |
172 and at that point the BGZF virtual offsets start to look different | |
173 to a simple offset into the decompressed data as exposed by the gzip | |
174 library. | |
175 | |
176 As an example, consider seeking to the decompressed position 196734. | |
177 Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this | |
178 is equivalent to jumping the first three blocks (which in this | |
179 specific example are all size 65536 after decompression - which | |
180 does not always hold) and starting at byte 126 of the fourth block | |
181 (after decompression). For BGZF, we need to know the fourth block's | |
182 offset of 55074 and the offset within the block of 126 to get the | |
183 BGZF virtual offset. | |
184 | |
185 >>> print(55074 << 16 | 126) | |
186 3609329790 | |
187 >>> print(bgzf.make_virtual_offset(55074, 126)) | |
188 3609329790 | |
189 | |
190 Thus for this BGZF file, decompressed position 196734 corresponds | |
191 to the virtual offset 3609329790. However, another BGZF file with | |
192 different contents would have compressed more or less efficiently, | |
193 so the compressed blocks would be different sizes. What this means | |
194 is the mapping between the uncompressed offset and the compressed | |
195 virtual offset depends on the BGZF file you are using. | |
196 | |
197 If you are accessing a BGZF file via this module, just use the | |
198 handle.tell() method to note the virtual offset of a position you | |
199 may later want to return to using handle.seek(). | |
200 | |
201 The catch with BGZF virtual offsets is while they can be compared | |
202 (which offset comes first in the file), you cannot safely subtract | |
203 them to get the size of the data between them, nor add/subtract | |
204 a relative offset. | |
205 | |
206 Of course you can parse this file with Bio.SeqIO using BgzfReader, | |
207 although there isn't any benefit over using gzip.open(...), unless | |
208 you want to index BGZF compressed sequence files: | |
209 | |
210 >>> from Bio import SeqIO | |
211 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz") | |
212 >>> record = SeqIO.read(handle, "genbank") | |
213 >>> handle.close() | |
214 >>> print(record.id) | |
215 NC_000932.1 | |
216 | |
217 Text Mode | |
218 --------- | |
219 | |
220 Like the standard library gzip.open(...), the BGZF code defaults to opening | |
221 files in binary mode. | |
222 | |
223 You can request the file be opened in text mode, but beware that this is hard | |
224 coded to the simple "latin1" (aka "iso-8859-1") encoding (which includes all | |
225 the ASCII characters), which works well with most Western European languages. | |
226 However, it is not fully compatible with the more widely used UTF-8 encoding. | |
227 | |
228 In variable width encodings like UTF-8, some single characters in the unicode | |
229 text output are represented by multiple bytes in the raw binary form. This is | |
230 problematic with BGZF, as we cannot always decode each block in isolation - a | |
231 single unicode character could be split over two blocks. This can even happen | |
232 with fixed width unicode encodings, as the BGZF block size is not fixed. | |
233 | |
234 Therefore, this module is currently restricted to only support single byte | |
235 unicode encodings, such as ASCII, "latin1" (which is a superset of ASCII), or | |
236 potentially other character maps (not implemented). | |
237 | |
238 Furthermore, unlike the default text mode on Python 3, we do not attempt to | |
239 implement universal new line mode. This transforms the various operating system | |
240 new line conventions like Windows (CR LF or "\r\n"), Unix (just LF, "\n"), or | |
241 old Macs (just CR, "\r"), into just LF ("\n"). Here we have the same problem - | |
242 is "\r" at the end of a block an incomplete Windows style new line? | |
243 | |
244 Instead, you will get the CR ("\r") and LF ("\n") characters as is. | |
245 | |
246 If your data is in UTF-8 or any other incompatible encoding, you must use | |
247 binary mode, and decode the appropriate fragments yourself. | |
248 """ | |
249 | |
250 import struct | |
251 import sys | |
252 import zlib | |
253 | |
254 from builtins import open as _open | |
255 | |
256 _bgzf_magic = b"\x1f\x8b\x08\x04" | |
257 _bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00" | |
258 _bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00" | |
259 _bytes_BC = b"BC" | |
260 | |
261 | |
262 def open(filename, mode="rb"): | |
263 r"""Open a BGZF file for reading, writing or appending. | |
264 | |
265 If text mode is requested, in order to avoid multi-byte characters, this is | |
266 hard coded to use the "latin1" encoding, and "\r" and "\n" are passed as is | |
267 (without implementing universal new line mode). | |
268 | |
269 If your data is in UTF-8 or any other incompatible encoding, you must use | |
270 binary mode, and decode the appropriate fragments yourself. | |
271 """ | |
272 if "r" in mode.lower(): | |
273 return BgzfReader(filename, mode) | |
274 elif "w" in mode.lower() or "a" in mode.lower(): | |
275 return BgzfWriter(filename, mode) | |
276 else: | |
277 raise ValueError(f"Bad mode {mode!r}") | |
278 | |
279 | |
280 def make_virtual_offset(block_start_offset, within_block_offset): | |
281 """Compute a BGZF virtual offset from block start and within block offsets. | |
282 | |
283 The BAM indexing scheme records read positions using a 64 bit | |
284 'virtual offset', comprising in C terms: | |
285 | |
286 block_start_offset << 16 | within_block_offset | |
287 | |
288 Here block_start_offset is the file offset of the BGZF block | |
289 start (unsigned integer using up to 64-16 = 48 bits), and | |
290 within_block_offset within the (decompressed) block (unsigned | |
291 16 bit integer). | |
292 | |
293 >>> make_virtual_offset(0, 0) | |
294 0 | |
295 >>> make_virtual_offset(0, 1) | |
296 1 | |
297 >>> make_virtual_offset(0, 2**16 - 1) | |
298 65535 | |
299 >>> make_virtual_offset(0, 2**16) | |
300 Traceback (most recent call last): | |
301 ... | |
302 ValueError: Require 0 <= within_block_offset < 2**16, got 65536 | |
303 | |
304 >>> 65536 == make_virtual_offset(1, 0) | |
305 True | |
306 >>> 65537 == make_virtual_offset(1, 1) | |
307 True | |
308 >>> 131071 == make_virtual_offset(1, 2**16 - 1) | |
309 True | |
310 | |
311 >>> 6553600000 == make_virtual_offset(100000, 0) | |
312 True | |
313 >>> 6553600001 == make_virtual_offset(100000, 1) | |
314 True | |
315 >>> 6553600010 == make_virtual_offset(100000, 10) | |
316 True | |
317 | |
318 >>> make_virtual_offset(2**48, 0) | |
319 Traceback (most recent call last): | |
320 ... | |
321 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656 | |
322 | |
323 """ | |
324 if within_block_offset < 0 or within_block_offset >= 65536: | |
325 raise ValueError( | |
326 "Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset | |
327 ) | |
328 if block_start_offset < 0 or block_start_offset >= 281474976710656: | |
329 raise ValueError( | |
330 "Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset | |
331 ) | |
332 return (block_start_offset << 16) | within_block_offset | |
333 | |
334 | |
335 def split_virtual_offset(virtual_offset): | |
336 """Divides a 64-bit BGZF virtual offset into block start & within block offsets. | |
337 | |
338 >>> (100000, 0) == split_virtual_offset(6553600000) | |
339 True | |
340 >>> (100000, 10) == split_virtual_offset(6553600010) | |
341 True | |
342 | |
343 """ | |
344 start = virtual_offset >> 16 | |
345 return start, virtual_offset ^ (start << 16) | |
346 | |
347 | |
348 def BgzfBlocks(handle): | |
349 """Low level debugging function to inspect BGZF blocks. | |
350 | |
351 Expects a BGZF compressed file opened in binary read mode using | |
352 the builtin open function. Do not use a handle from this bgzf | |
353 module or the gzip module's open function which will decompress | |
354 the file. | |
355 | |
356 Returns the block start offset (see virtual offsets), the block | |
357 length (add these for the start of the next block), and the | |
358 decompressed length of the blocks contents (limited to 65536 in | |
359 BGZF), as an iterator - one tuple per BGZF block. | |
360 | |
361 >>> from builtins import open | |
362 >>> handle = open("SamBam/ex1.bam", "rb") | |
363 >>> for values in BgzfBlocks(handle): | |
364 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) | |
365 Raw start 0, raw length 18239; data start 0, data length 65536 | |
366 Raw start 18239, raw length 18223; data start 65536, data length 65536 | |
367 Raw start 36462, raw length 18017; data start 131072, data length 65536 | |
368 Raw start 54479, raw length 17342; data start 196608, data length 65536 | |
369 Raw start 71821, raw length 17715; data start 262144, data length 65536 | |
370 Raw start 89536, raw length 17728; data start 327680, data length 65536 | |
371 Raw start 107264, raw length 17292; data start 393216, data length 63398 | |
372 Raw start 124556, raw length 28; data start 456614, data length 0 | |
373 >>> handle.close() | |
374 | |
375 Indirectly we can tell this file came from an old version of | |
376 samtools because all the blocks (except the final one and the | |
377 dummy empty EOF marker block) are 65536 bytes. Later versions | |
378 avoid splitting a read between two blocks, and give the header | |
379 its own block (useful to speed up replacing the header). You | |
380 can see this in ex1_refresh.bam created using samtools 0.1.18: | |
381 | |
382 samtools view -b ex1.bam > ex1_refresh.bam | |
383 | |
384 >>> handle = open("SamBam/ex1_refresh.bam", "rb") | |
385 >>> for values in BgzfBlocks(handle): | |
386 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) | |
387 Raw start 0, raw length 53; data start 0, data length 38 | |
388 Raw start 53, raw length 18195; data start 38, data length 65434 | |
389 Raw start 18248, raw length 18190; data start 65472, data length 65409 | |
390 Raw start 36438, raw length 18004; data start 130881, data length 65483 | |
391 Raw start 54442, raw length 17353; data start 196364, data length 65519 | |
392 Raw start 71795, raw length 17708; data start 261883, data length 65411 | |
393 Raw start 89503, raw length 17709; data start 327294, data length 65466 | |
394 Raw start 107212, raw length 17390; data start 392760, data length 63854 | |
395 Raw start 124602, raw length 28; data start 456614, data length 0 | |
396 >>> handle.close() | |
397 | |
398 The above example has no embedded SAM header (thus the first block | |
399 is very small at just 38 bytes of decompressed data), while the next | |
400 example does (a larger block of 103 bytes). Notice that the rest of | |
401 the blocks show the same sizes (they contain the same read data): | |
402 | |
403 >>> handle = open("SamBam/ex1_header.bam", "rb") | |
404 >>> for values in BgzfBlocks(handle): | |
405 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) | |
406 Raw start 0, raw length 104; data start 0, data length 103 | |
407 Raw start 104, raw length 18195; data start 103, data length 65434 | |
408 Raw start 18299, raw length 18190; data start 65537, data length 65409 | |
409 Raw start 36489, raw length 18004; data start 130946, data length 65483 | |
410 Raw start 54493, raw length 17353; data start 196429, data length 65519 | |
411 Raw start 71846, raw length 17708; data start 261948, data length 65411 | |
412 Raw start 89554, raw length 17709; data start 327359, data length 65466 | |
413 Raw start 107263, raw length 17390; data start 392825, data length 63854 | |
414 Raw start 124653, raw length 28; data start 456679, data length 0 | |
415 >>> handle.close() | |
416 | |
417 """ | |
418 if isinstance(handle, BgzfReader): | |
419 raise TypeError("Function BgzfBlocks expects a binary handle") | |
420 data_start = 0 | |
421 while True: | |
422 start_offset = handle.tell() | |
423 try: | |
424 block_length, data = _load_bgzf_block(handle) | |
425 except StopIteration: | |
426 break | |
427 data_len = len(data) | |
428 yield start_offset, block_length, data_start, data_len | |
429 data_start += data_len | |
430 | |
431 | |
432 def _load_bgzf_block(handle, text_mode=False): | |
433 """Load the next BGZF block of compressed data (PRIVATE). | |
434 | |
435 Returns a tuple (block size and data), or at end of file | |
436 will raise StopIteration. | |
437 """ | |
438 magic = handle.read(4) | |
439 if not magic: | |
440 # End of file - should we signal this differently now? | |
441 # See https://www.python.org/dev/peps/pep-0479/ | |
442 raise StopIteration | |
443 if magic != _bgzf_magic: | |
444 raise ValueError( | |
445 r"A BGZF (e.g. a BAM file) block should start with " | |
446 r"%r, not %r; handle.tell() now says %r" | |
447 % (_bgzf_magic, magic, handle.tell()) | |
448 ) | |
449 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = struct.unpack( | |
450 "<LBBH", handle.read(8) | |
451 ) | |
452 | |
453 block_size = None | |
454 x_len = 0 | |
455 while x_len < extra_len: | |
456 subfield_id = handle.read(2) | |
457 subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t | |
458 subfield_data = handle.read(subfield_len) | |
459 x_len += subfield_len + 4 | |
460 if subfield_id == _bytes_BC: | |
461 if subfield_len != 2: | |
462 raise ValueError("Wrong BC payload length") | |
463 if block_size is not None: | |
464 raise ValueError("Two BC subfields?") | |
465 block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t | |
466 if x_len != extra_len: | |
467 raise ValueError(f"x_len and extra_len differ {x_len}, {extra_len}") | |
468 if block_size is None: | |
469 raise ValueError("Missing BC, this isn't a BGZF file!") | |
470 # Now comes the compressed data, CRC, and length of uncompressed data. | |
471 deflate_size = block_size - 1 - extra_len - 19 | |
472 d = zlib.decompressobj(-15) # Negative window size means no headers | |
473 data = d.decompress(handle.read(deflate_size)) + d.flush() | |
474 expected_crc = handle.read(4) | |
475 expected_size = struct.unpack("<I", handle.read(4))[0] | |
476 if expected_size != len(data): | |
477 raise RuntimeError("Decompressed to %i, not %i" % (len(data), expected_size)) | |
478 # Should cope with a mix of Python platforms... | |
479 crc = zlib.crc32(data) | |
480 if crc < 0: | |
481 crc = struct.pack("<i", crc) | |
482 else: | |
483 crc = struct.pack("<I", crc) | |
484 if expected_crc != crc: | |
485 raise RuntimeError(f"CRC is {crc}, not {expected_crc}") | |
486 if text_mode: | |
487 # Note ISO-8859-1 aka Latin-1 preserves first 256 chars | |
488 # (i.e. ASCII), but critically is a single byte encoding | |
489 return block_size, data.decode("latin-1") | |
490 else: | |
491 return block_size, data | |
492 | |
493 | |
494 class BgzfReader: | |
495 r"""BGZF reader, acts like a read only handle but seek/tell differ. | |
496 | |
497 Let's use the BgzfBlocks function to have a peek at the BGZF blocks | |
498 in an example BAM file, | |
499 | |
500 >>> from builtins import open | |
501 >>> handle = open("SamBam/ex1.bam", "rb") | |
502 >>> for values in BgzfBlocks(handle): | |
503 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) | |
504 Raw start 0, raw length 18239; data start 0, data length 65536 | |
505 Raw start 18239, raw length 18223; data start 65536, data length 65536 | |
506 Raw start 36462, raw length 18017; data start 131072, data length 65536 | |
507 Raw start 54479, raw length 17342; data start 196608, data length 65536 | |
508 Raw start 71821, raw length 17715; data start 262144, data length 65536 | |
509 Raw start 89536, raw length 17728; data start 327680, data length 65536 | |
510 Raw start 107264, raw length 17292; data start 393216, data length 63398 | |
511 Raw start 124556, raw length 28; data start 456614, data length 0 | |
512 >>> handle.close() | |
513 | |
514 Now let's see how to use this block information to jump to | |
515 specific parts of the decompressed BAM file: | |
516 | |
517 >>> handle = BgzfReader("SamBam/ex1.bam", "rb") | |
518 >>> assert 0 == handle.tell() | |
519 >>> magic = handle.read(4) | |
520 >>> assert 4 == handle.tell() | |
521 | |
522 So far nothing so strange, we got the magic marker used at the | |
523 start of a decompressed BAM file, and the handle position makes | |
524 sense. Now however, let's jump to the end of this block and 4 | |
525 bytes into the next block by reading 65536 bytes, | |
526 | |
527 >>> data = handle.read(65536) | |
528 >>> len(data) | |
529 65536 | |
530 >>> assert 1195311108 == handle.tell() | |
531 | |
532 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit | |
533 virtual offset, which means: | |
534 | |
535 >>> split_virtual_offset(1195311108) | |
536 (18239, 4) | |
537 | |
538 You should spot 18239 as the start of the second BGZF block, while | |
539 the 4 is the offset into this block. See also make_virtual_offset, | |
540 | |
541 >>> make_virtual_offset(18239, 4) | |
542 1195311108 | |
543 | |
544 Let's jump back to almost the start of the file, | |
545 | |
546 >>> make_virtual_offset(0, 2) | |
547 2 | |
548 >>> handle.seek(2) | |
549 2 | |
550 >>> handle.close() | |
551 | |
552 Note that you can use the max_cache argument to limit the number of | |
553 BGZF blocks cached in memory. The default is 100, and since each | |
554 block can be up to 64kb, the default cache could take up to 6MB of | |
555 RAM. The cache is not important for reading through the file in one | |
556 pass, but is important for improving performance of random access. | |
557 """ | |
558 | |
559 def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100): | |
560 r"""Initialize the class for reading a BGZF file. | |
561 | |
562 You would typically use the top level ``bgzf.open(...)`` function | |
563 which will call this class internally. Direct use is discouraged. | |
564 | |
565 Either the ``filename`` (string) or ``fileobj`` (input file object in | |
566 binary mode) arguments must be supplied, but not both. | |
567 | |
568 Argument ``mode`` controls if the data will be returned as strings in | |
569 text mode ("rt", "tr", or default "r"), or bytes binary mode ("rb" | |
570 or "br"). The argument name matches the built-in ``open(...)`` and | |
571 standard library ``gzip.open(...)`` function. | |
572 | |
573 If text mode is requested, in order to avoid multi-byte characters, | |
574 this is hard coded to use the "latin1" encoding, and "\r" and "\n" | |
575 are passed as is (without implementing universal new line mode). There | |
576 is no ``encoding`` argument. | |
577 | |
578 If your data is in UTF-8 or any other incompatible encoding, you must | |
579 use binary mode, and decode the appropriate fragments yourself. | |
580 | |
581 Argument ``max_cache`` controls the maximum number of BGZF blocks to | |
582 cache in memory. Each can be up to 64kb thus the default of 100 blocks | |
583 could take up to 6MB of RAM. This is important for efficient random | |
584 access, a small value is fine for reading the file in one pass. | |
585 """ | |
586 # TODO - Assuming we can seek, check for 28 bytes EOF empty block | |
587 # and if missing warn about possible truncation (as in samtools)? | |
588 if max_cache < 1: | |
589 raise ValueError("Use max_cache with a minimum of 1") | |
590 # Must open the BGZF file in binary mode, but we may want to | |
591 # treat the contents as either text or binary (unicode or | |
592 # bytes under Python 3) | |
593 if filename and fileobj: | |
594 raise ValueError("Supply either filename or fileobj, not both") | |
595 # Want to reject output modes like w, a, x, + | |
596 if mode.lower() not in ("r", "tr", "rt", "rb", "br"): | |
597 raise ValueError( | |
598 "Must use a read mode like 'r' (default), 'rt', or 'rb' for binary" | |
599 ) | |
600 # If an open file was passed, make sure it was opened in binary mode. | |
601 if fileobj: | |
602 if fileobj.read(0) != b"": | |
603 raise ValueError("fileobj not opened in binary mode") | |
604 handle = fileobj | |
605 else: | |
606 handle = _open(filename, "rb") | |
607 self._text = "b" not in mode.lower() | |
608 if self._text: | |
609 self._newline = "\n" | |
610 else: | |
611 self._newline = b"\n" | |
612 self._handle = handle | |
613 self.max_cache = max_cache | |
614 self._buffers = {} | |
615 self._block_start_offset = None | |
616 self._block_raw_length = None | |
617 self._load_block(handle.tell()) | |
618 | |
619 def _load_block(self, start_offset=None): | |
620 if start_offset is None: | |
621 # If the file is being read sequentially, then _handle.tell() | |
622 # should be pointing at the start of the next block. | |
623 # However, if seek has been used, we can't assume that. | |
624 start_offset = self._block_start_offset + self._block_raw_length | |
625 if start_offset == self._block_start_offset: | |
626 self._within_block_offset = 0 | |
627 return | |
628 elif start_offset in self._buffers: | |
629 # Already in cache | |
630 self._buffer, self._block_raw_length = self._buffers[start_offset] | |
631 self._within_block_offset = 0 | |
632 self._block_start_offset = start_offset | |
633 return | |
634 # Must hit the disk... first check cache limits, | |
635 while len(self._buffers) >= self.max_cache: | |
636 # TODO - Implement LRU cache removal? | |
637 self._buffers.popitem() | |
638 # Now load the block | |
639 handle = self._handle | |
640 if start_offset is not None: | |
641 handle.seek(start_offset) | |
642 self._block_start_offset = handle.tell() | |
643 try: | |
644 block_size, self._buffer = _load_bgzf_block(handle, self._text) | |
645 except StopIteration: | |
646 # EOF | |
647 block_size = 0 | |
648 if self._text: | |
649 self._buffer = "" | |
650 else: | |
651 self._buffer = b"" | |
652 self._within_block_offset = 0 | |
653 self._block_raw_length = block_size | |
654 # Finally save the block in our cache, | |
655 self._buffers[self._block_start_offset] = self._buffer, block_size | |
656 | |
657 def tell(self): | |
658 """Return a 64-bit unsigned BGZF virtual offset.""" | |
659 if 0 < self._within_block_offset and self._within_block_offset == len( | |
660 self._buffer | |
661 ): | |
662 # Special case where we're right at the end of a (non empty) block. | |
663 # For non-maximal blocks could give two possible virtual offsets, | |
664 # but for a maximal block can't use 65536 as the within block | |
665 # offset. Therefore for consistency, use the next block and a | |
666 # within block offset of zero. | |
667 return (self._block_start_offset + self._block_raw_length) << 16 | |
668 else: | |
669 # return make_virtual_offset(self._block_start_offset, | |
670 # self._within_block_offset) | |
671 # TODO - Include bounds checking as in make_virtual_offset? | |
672 return (self._block_start_offset << 16) | self._within_block_offset | |
673 | |
674 def seek(self, virtual_offset): | |
675 """Seek to a 64-bit unsigned BGZF virtual offset.""" | |
676 # Do this inline to avoid a function call, | |
677 # start_offset, within_block = split_virtual_offset(virtual_offset) | |
678 start_offset = virtual_offset >> 16 | |
679 within_block = virtual_offset ^ (start_offset << 16) | |
680 if start_offset != self._block_start_offset: | |
681 # Don't need to load the block if already there | |
682 # (this avoids a function call since _load_block would do nothing) | |
683 self._load_block(start_offset) | |
684 if start_offset != self._block_start_offset: | |
685 raise ValueError("start_offset not loaded correctly") | |
686 if within_block > len(self._buffer): | |
687 if not (within_block == 0 and len(self._buffer) == 0): | |
688 raise ValueError( | |
689 "Within offset %i but block size only %i" | |
690 % (within_block, len(self._buffer)) | |
691 ) | |
692 self._within_block_offset = within_block | |
693 # assert virtual_offset == self.tell(), \ | |
694 # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \ | |
695 # % (virtual_offset, start_offset, within_block, | |
696 # self.tell(), self._block_start_offset, | |
697 # self._within_block_offset) | |
698 return virtual_offset | |
699 | |
700 def read(self, size=-1): | |
701 """Read method for the BGZF module.""" | |
702 if size < 0: | |
703 raise NotImplementedError("Don't be greedy, that could be massive!") | |
704 | |
705 result = "" if self._text else b"" | |
706 while size and self._block_raw_length: | |
707 if self._within_block_offset + size <= len(self._buffer): | |
708 # This may leave us right at the end of a block | |
709 # (lazy loading, don't load the next block unless we have too) | |
710 data = self._buffer[ | |
711 self._within_block_offset : self._within_block_offset + size | |
712 ] | |
713 self._within_block_offset += size | |
714 if not data: | |
715 raise ValueError("Must be at least 1 byte") | |
716 result += data | |
717 break | |
718 else: | |
719 data = self._buffer[self._within_block_offset :] | |
720 size -= len(data) | |
721 self._load_block() # will reset offsets | |
722 result += data | |
723 | |
724 return result | |
725 | |
726 def readline(self): | |
727 """Read a single line for the BGZF file.""" | |
728 result = "" if self._text else b"" | |
729 while self._block_raw_length: | |
730 i = self._buffer.find(self._newline, self._within_block_offset) | |
731 # Three cases to consider, | |
732 if i == -1: | |
733 # No newline, need to read in more data | |
734 data = self._buffer[self._within_block_offset :] | |
735 self._load_block() # will reset offsets | |
736 result += data | |
737 elif i + 1 == len(self._buffer): | |
738 # Found new line, but right at end of block (SPECIAL) | |
739 data = self._buffer[self._within_block_offset :] | |
740 # Must now load the next block to ensure tell() works | |
741 self._load_block() # will reset offsets | |
742 if not data: | |
743 raise ValueError("Must be at least 1 byte") | |
744 result += data | |
745 break | |
746 else: | |
747 # Found new line, not at end of block (easy case, no IO) | |
748 data = self._buffer[self._within_block_offset : i + 1] | |
749 self._within_block_offset = i + 1 | |
750 # assert data.endswith(self._newline) | |
751 result += data | |
752 break | |
753 | |
754 return result | |
755 | |
756 def __next__(self): | |
757 """Return the next line.""" | |
758 line = self.readline() | |
759 if not line: | |
760 raise StopIteration | |
761 return line | |
762 | |
763 def __iter__(self): | |
764 """Iterate over the lines in the BGZF file.""" | |
765 return self | |
766 | |
767 def close(self): | |
768 """Close BGZF file.""" | |
769 self._handle.close() | |
770 self._buffer = None | |
771 self._block_start_offset = None | |
772 self._buffers = None | |
773 | |
774 def seekable(self): | |
775 """Return True indicating the BGZF supports random access.""" | |
776 return True | |
777 | |
778 def isatty(self): | |
779 """Return True if connected to a TTY device.""" | |
780 return False | |
781 | |
782 def fileno(self): | |
783 """Return integer file descriptor.""" | |
784 return self._handle.fileno() | |
785 | |
786 def __enter__(self): | |
787 """Open a file operable with WITH statement.""" | |
788 return self | |
789 | |
790 def __exit__(self, type, value, traceback): | |
791 """Close a file with WITH statement.""" | |
792 self.close() | |
793 | |
794 | |
795 class BgzfWriter: | |
796 """Define a BGZFWriter object.""" | |
797 | |
798 def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6): | |
799 """Initilize the class.""" | |
800 if filename and fileobj: | |
801 raise ValueError("Supply either filename or fileobj, not both") | |
802 # If an open file was passed, make sure it was opened in binary mode. | |
803 if fileobj: | |
804 if fileobj.read(0) != b"": | |
805 raise ValueError("fileobj not opened in binary mode") | |
806 handle = fileobj | |
807 else: | |
808 if "w" not in mode.lower() and "a" not in mode.lower(): | |
809 raise ValueError(f"Must use write or append mode, not {mode!r}") | |
810 if "a" in mode.lower(): | |
811 handle = _open(filename, "ab") | |
812 else: | |
813 handle = _open(filename, "wb") | |
814 self._text = "b" not in mode.lower() | |
815 self._handle = handle | |
816 self._buffer = b"" | |
817 self.compresslevel = compresslevel | |
818 | |
819 def _write_block(self, block): | |
820 """Write provided data to file as a single BGZF compressed block (PRIVATE).""" | |
821 # print("Saving %i bytes" % len(block)) | |
822 if len(block) > 65536: | |
823 raise ValueError(f"{len(block)} Block length > 65536") | |
824 # Giving a negative window bits means no gzip/zlib headers, | |
825 # -15 used in samtools | |
826 c = zlib.compressobj( | |
827 self.compresslevel, zlib.DEFLATED, -15, zlib.DEF_MEM_LEVEL, 0 | |
828 ) | |
829 compressed = c.compress(block) + c.flush() | |
830 del c | |
831 if len(compressed) > 65536: | |
832 raise RuntimeError( | |
833 "TODO - Didn't compress enough, try less data in this block" | |
834 ) | |
835 crc = zlib.crc32(block) | |
836 # Should cope with a mix of Python platforms... | |
837 if crc < 0: | |
838 crc = struct.pack("<i", crc) | |
839 else: | |
840 crc = struct.pack("<I", crc) | |
841 bsize = struct.pack("<H", len(compressed) + 25) # includes -1 | |
842 crc = struct.pack("<I", zlib.crc32(block) & 0xFFFFFFFF) | |
843 uncompressed_length = struct.pack("<I", len(block)) | |
844 # Fixed 16 bytes, | |
845 # gzip magic bytes (4) mod time (4), | |
846 # gzip flag (1), os (1), extra length which is six (2), | |
847 # sub field which is BC (2), sub field length of two (2), | |
848 # Variable data, | |
849 # 2 bytes: block length as BC sub field (2) | |
850 # X bytes: the data | |
851 # 8 bytes: crc (4), uncompressed data length (4) | |
852 data = _bgzf_header + bsize + compressed + crc + uncompressed_length | |
853 self._handle.write(data) | |
854 | |
855 def write(self, data): | |
856 """Write method for the class.""" | |
857 # TODO - Check bytes vs unicode | |
858 if isinstance(data, str): | |
859 # When reading we can't cope with multi-byte characters | |
860 # being split between BGZF blocks, so we restrict to a | |
861 # single byte encoding - like ASCII or latin-1. | |
862 # On output we could probably allow any encoding, as we | |
863 # don't care about splitting unicode characters between blocks | |
864 data = data.encode("latin-1") | |
865 # block_size = 2**16 = 65536 | |
866 data_len = len(data) | |
867 if len(self._buffer) + data_len < 65536: | |
868 # print("Cached %r" % data) | |
869 self._buffer += data | |
870 else: | |
871 # print("Got %r, writing out some data..." % data) | |
872 self._buffer += data | |
873 while len(self._buffer) >= 65536: | |
874 self._write_block(self._buffer[:65536]) | |
875 self._buffer = self._buffer[65536:] | |
876 | |
877 def flush(self): | |
878 """Flush data explicitally.""" | |
879 while len(self._buffer) >= 65536: | |
880 self._write_block(self._buffer[:65535]) | |
881 self._buffer = self._buffer[65535:] | |
882 self._write_block(self._buffer) | |
883 self._buffer = b"" | |
884 self._handle.flush() | |
885 | |
886 def close(self): | |
887 """Flush data, write 28 bytes BGZF EOF marker, and close BGZF file. | |
888 | |
889 samtools will look for a magic EOF marker, just a 28 byte empty BGZF | |
890 block, and if it is missing warns the BAM file may be truncated. In | |
891 addition to samtools writing this block, so too does bgzip - so this | |
892 implementation does too. | |
893 """ | |
894 if self._buffer: | |
895 self.flush() | |
896 self._handle.write(_bgzf_eof) | |
897 self._handle.flush() | |
898 self._handle.close() | |
899 | |
900 def tell(self): | |
901 """Return a BGZF 64-bit virtual offset.""" | |
902 return make_virtual_offset(self._handle.tell(), len(self._buffer)) | |
903 | |
904 def seekable(self): | |
905 """Return True indicating the BGZF supports random access.""" | |
906 # Not seekable, but we do support tell... | |
907 return False | |
908 | |
909 def isatty(self): | |
910 """Return True if connected to a TTY device.""" | |
911 return False | |
912 | |
913 def fileno(self): | |
914 """Return integer file descriptor.""" | |
915 return self._handle.fileno() | |
916 | |
917 def __enter__(self): | |
918 """Open a file operable with WITH statement.""" | |
919 return self | |
920 | |
921 def __exit__(self, type, value, traceback): | |
922 """Close a file with WITH statement.""" | |
923 self.close() | |
924 | |
925 | |
926 if __name__ == "__main__": | |
927 if len(sys.argv) > 1: | |
928 print("Call this with no arguments and pipe uncompressed data in on stdin") | |
929 print("and it will produce BGZF compressed data on stdout. e.g.") | |
930 print("") | |
931 print("./bgzf.py < example.fastq > example.fastq.bgz") | |
932 print("") | |
933 print("The extension convention of *.bgz is to distinugish these from *.gz") | |
934 print("used for standard gzipped files without the block structure of BGZF.") | |
935 print("You can use the standard gunzip command to decompress BGZF files,") | |
936 print("if it complains about the extension try something like this:") | |
937 print("") | |
938 print("cat example.fastq.bgz | gunzip > example.fastq") | |
939 print("") | |
940 print("See also the tool bgzip that comes with samtools") | |
941 sys.exit(0) | |
942 | |
943 # Ensure we have binary mode handles | |
944 # (leave stderr as default text mode) | |
945 stdin = sys.stdin.buffer | |
946 stdout = sys.stdout.buffer | |
947 | |
948 sys.stderr.write("Producing BGZF output from stdin...\n") | |
949 w = BgzfWriter(fileobj=stdout) | |
950 while True: | |
951 data = stdin.read(65536) | |
952 w.write(data) | |
953 if not data: | |
954 break | |
955 # Doing close will write an empty BGZF block as EOF marker: | |
956 w.close() | |
957 sys.stderr.write("BGZF data produced\n") |