jpayne@69
|
1 #!/usr/bin/env python
|
jpayne@69
|
2 # Copyright 2010-2018 by Peter Cock.
|
jpayne@69
|
3 # All rights reserved.
|
jpayne@69
|
4 #
|
jpayne@69
|
5 # This file is part of the Biopython distribution and governed by your
|
jpayne@69
|
6 # choice of the "Biopython License Agreement" or the "BSD 3-Clause License".
|
jpayne@69
|
7 # Please see the LICENSE file that should have been included as part of this
|
jpayne@69
|
8 # package.
|
jpayne@69
|
9 r"""Read and write BGZF compressed files (the GZIP variant used in BAM).
|
jpayne@69
|
10
|
jpayne@69
|
11 The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text
|
jpayne@69
|
12 format (SAM), and a compressed binary format (BAM). The latter uses a
|
jpayne@69
|
13 modified form of gzip compression called BGZF (Blocked GNU Zip Format),
|
jpayne@69
|
14 which can be applied to any file format to provide compression with
|
jpayne@69
|
15 efficient random access. BGZF is described together with the SAM/BAM
|
jpayne@69
|
16 file format at https://samtools.sourceforge.net/SAM1.pdf
|
jpayne@69
|
17
|
jpayne@69
|
18 Please read the text below about 'virtual offsets' before using BGZF
|
jpayne@69
|
19 files for random access.
|
jpayne@69
|
20
|
jpayne@69
|
21 Aim of this module
|
jpayne@69
|
22 ------------------
|
jpayne@69
|
23 The Python gzip library can be used to read BGZF files, since for
|
jpayne@69
|
24 decompression they are just (specialised) gzip files. What this
|
jpayne@69
|
25 module aims to facilitate is random access to BGZF files (using the
|
jpayne@69
|
26 'virtual offset' idea), and writing BGZF files (which means using
|
jpayne@69
|
27 suitably sized gzip blocks and writing the extra 'BC' field in the
|
jpayne@69
|
28 gzip headers). As in the gzip library, the zlib library is used
|
jpayne@69
|
29 internally.
|
jpayne@69
|
30
|
jpayne@69
|
31 In addition to being required for random access to and writing of
|
jpayne@69
|
32 BAM files, the BGZF format can also be used on other sequential
|
jpayne@69
|
33 data (in the sense of one record after another), such as most of
|
jpayne@69
|
34 the sequence data formats supported in Bio.SeqIO (like FASTA,
|
jpayne@69
|
35 FASTQ, GenBank, etc) or large MAF alignments.
|
jpayne@69
|
36
|
jpayne@69
|
37 The Bio.SeqIO indexing functions use this module to support BGZF files.
|
jpayne@69
|
38
|
jpayne@69
|
39 Technical Introduction to BGZF
|
jpayne@69
|
40 ------------------------------
|
jpayne@69
|
41 The gzip file format allows multiple compressed blocks, each of which
|
jpayne@69
|
42 could be a stand alone gzip file. As an interesting bonus, this means
|
jpayne@69
|
43 you can use Unix ``cat`` to combine two or more gzip files into one by
|
jpayne@69
|
44 concatenating them. Also, each block can have one of several compression
|
jpayne@69
|
45 levels (including uncompressed, which actually takes up a little bit
|
jpayne@69
|
46 more space due to the gzip header).
|
jpayne@69
|
47
|
jpayne@69
|
48 What the BAM designers realised was that while random access to data
|
jpayne@69
|
49 stored in traditional gzip files was slow, breaking the file into
|
jpayne@69
|
50 gzip blocks would allow fast random access to each block. To access
|
jpayne@69
|
51 a particular piece of the decompressed data, you just need to know
|
jpayne@69
|
52 which block it starts in (the offset of the gzip block start), and
|
jpayne@69
|
53 how far into the (decompressed) contents of the block you need to
|
jpayne@69
|
54 read.
|
jpayne@69
|
55
|
jpayne@69
|
56 One problem with this is finding the gzip block sizes efficiently.
|
jpayne@69
|
57 You can do it with a standard gzip file, but it requires every block
|
jpayne@69
|
58 to be decompressed -- and that would be rather slow. Additionally
|
jpayne@69
|
59 typical gzip files may use very large blocks.
|
jpayne@69
|
60
|
jpayne@69
|
61 All that differs in BGZF is that compressed size of each gzip block
|
jpayne@69
|
62 is limited to 2^16 bytes, and an extra 'BC' field in the gzip header
|
jpayne@69
|
63 records this size. Traditional decompression tools can ignore this,
|
jpayne@69
|
64 and unzip the file just like any other gzip file.
|
jpayne@69
|
65
|
jpayne@69
|
66 The point of this is you can look at the first BGZF block, find out
|
jpayne@69
|
67 how big it is from this 'BC' header, and thus seek immediately to
|
jpayne@69
|
68 the second block, and so on.
|
jpayne@69
|
69
|
jpayne@69
|
70 The BAM indexing scheme records read positions using a 64 bit
|
jpayne@69
|
71 'virtual offset', comprising ``coffset << 16 | uoffset``, where ``coffset``
|
jpayne@69
|
72 is the file offset of the BGZF block containing the start of the read
|
jpayne@69
|
73 (unsigned integer using up to 64-16 = 48 bits), and ``uoffset`` is the
|
jpayne@69
|
74 offset within the (decompressed) block (unsigned 16 bit integer).
|
jpayne@69
|
75
|
jpayne@69
|
76 This limits you to BAM files where the last block starts by 2^48
|
jpayne@69
|
77 bytes, or 256 petabytes, and the decompressed size of each block
|
jpayne@69
|
78 is at most 2^16 bytes, or 64kb. Note that this matches the BGZF
|
jpayne@69
|
79 'BC' field size which limits the compressed size of each block to
|
jpayne@69
|
80 2^16 bytes, allowing for BAM files to use BGZF with no gzip
|
jpayne@69
|
81 compression (useful for intermediate files in memory to reduce
|
jpayne@69
|
82 CPU load).
|
jpayne@69
|
83
|
jpayne@69
|
84 Warning about namespaces
|
jpayne@69
|
85 ------------------------
|
jpayne@69
|
86 It is considered a bad idea to use "from XXX import ``*``" in Python, because
|
jpayne@69
|
87 it pollutes the namespace. This is a real issue with Bio.bgzf (and the
|
jpayne@69
|
88 standard Python library gzip) because they contain a function called open
|
jpayne@69
|
89 i.e. Suppose you do this:
|
jpayne@69
|
90
|
jpayne@69
|
91 >>> from Bio.bgzf import *
|
jpayne@69
|
92 >>> print(open.__module__)
|
jpayne@69
|
93 Bio.bgzf
|
jpayne@69
|
94
|
jpayne@69
|
95 Or,
|
jpayne@69
|
96
|
jpayne@69
|
97 >>> from gzip import *
|
jpayne@69
|
98 >>> print(open.__module__)
|
jpayne@69
|
99 gzip
|
jpayne@69
|
100
|
jpayne@69
|
101 Notice that the open function has been replaced. You can "fix" this if you
|
jpayne@69
|
102 need to by importing the built-in open function:
|
jpayne@69
|
103
|
jpayne@69
|
104 >>> from builtins import open
|
jpayne@69
|
105
|
jpayne@69
|
106 However, what we recommend instead is to use the explicit namespace, e.g.
|
jpayne@69
|
107
|
jpayne@69
|
108 >>> from Bio import bgzf
|
jpayne@69
|
109 >>> print(bgzf.open.__module__)
|
jpayne@69
|
110 Bio.bgzf
|
jpayne@69
|
111
|
jpayne@69
|
112
|
jpayne@69
|
113 Examples
|
jpayne@69
|
114 --------
|
jpayne@69
|
115 This is an ordinary GenBank file compressed using BGZF, so it can
|
jpayne@69
|
116 be decompressed using gzip,
|
jpayne@69
|
117
|
jpayne@69
|
118 >>> import gzip
|
jpayne@69
|
119 >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r")
|
jpayne@69
|
120 >>> assert 0 == handle.tell()
|
jpayne@69
|
121 >>> line = handle.readline()
|
jpayne@69
|
122 >>> assert 80 == handle.tell()
|
jpayne@69
|
123 >>> line = handle.readline()
|
jpayne@69
|
124 >>> assert 143 == handle.tell()
|
jpayne@69
|
125 >>> data = handle.read(70000)
|
jpayne@69
|
126 >>> assert 70143 == handle.tell()
|
jpayne@69
|
127 >>> handle.close()
|
jpayne@69
|
128
|
jpayne@69
|
129 We can also access the file using the BGZF reader - but pay
|
jpayne@69
|
130 attention to the file offsets which will be explained below:
|
jpayne@69
|
131
|
jpayne@69
|
132 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r")
|
jpayne@69
|
133 >>> assert 0 == handle.tell()
|
jpayne@69
|
134 >>> print(handle.readline().rstrip())
|
jpayne@69
|
135 LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009
|
jpayne@69
|
136 >>> assert 80 == handle.tell()
|
jpayne@69
|
137 >>> print(handle.readline().rstrip())
|
jpayne@69
|
138 DEFINITION Arabidopsis thaliana chloroplast, complete genome.
|
jpayne@69
|
139 >>> assert 143 == handle.tell()
|
jpayne@69
|
140 >>> data = handle.read(70000)
|
jpayne@69
|
141 >>> assert 987828735 == handle.tell()
|
jpayne@69
|
142 >>> print(handle.readline().rstrip())
|
jpayne@69
|
143 f="GeneID:844718"
|
jpayne@69
|
144 >>> print(handle.readline().rstrip())
|
jpayne@69
|
145 CDS complement(join(84337..84771,85454..85843))
|
jpayne@69
|
146 >>> offset = handle.seek(make_virtual_offset(55074, 126))
|
jpayne@69
|
147 >>> print(handle.readline().rstrip())
|
jpayne@69
|
148 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat
|
jpayne@69
|
149 >>> handle.close()
|
jpayne@69
|
150
|
jpayne@69
|
151 Notice the handle's offset looks different as a BGZF file. This
|
jpayne@69
|
152 brings us to the key point about BGZF, which is the block structure:
|
jpayne@69
|
153
|
jpayne@69
|
154 >>> handle = open("GenBank/NC_000932.gb.bgz", "rb")
|
jpayne@69
|
155 >>> for values in BgzfBlocks(handle):
|
jpayne@69
|
156 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
|
jpayne@69
|
157 Raw start 0, raw length 15073; data start 0, data length 65536
|
jpayne@69
|
158 Raw start 15073, raw length 17857; data start 65536, data length 65536
|
jpayne@69
|
159 Raw start 32930, raw length 22144; data start 131072, data length 65536
|
jpayne@69
|
160 Raw start 55074, raw length 22230; data start 196608, data length 65536
|
jpayne@69
|
161 Raw start 77304, raw length 14939; data start 262144, data length 43478
|
jpayne@69
|
162 Raw start 92243, raw length 28; data start 305622, data length 0
|
jpayne@69
|
163 >>> handle.close()
|
jpayne@69
|
164
|
jpayne@69
|
165 In this example the first three blocks are 'full' and hold 65536 bytes
|
jpayne@69
|
166 of uncompressed data. The fourth block isn't full and holds 43478 bytes.
|
jpayne@69
|
167 Finally there is a special empty fifth block which takes 28 bytes on
|
jpayne@69
|
168 disk and serves as an 'end of file' (EOF) marker. If this is missing,
|
jpayne@69
|
169 it is possible your BGZF file is incomplete.
|
jpayne@69
|
170
|
jpayne@69
|
171 By reading ahead 70,000 bytes we moved into the second BGZF block,
|
jpayne@69
|
172 and at that point the BGZF virtual offsets start to look different
|
jpayne@69
|
173 to a simple offset into the decompressed data as exposed by the gzip
|
jpayne@69
|
174 library.
|
jpayne@69
|
175
|
jpayne@69
|
176 As an example, consider seeking to the decompressed position 196734.
|
jpayne@69
|
177 Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this
|
jpayne@69
|
178 is equivalent to jumping the first three blocks (which in this
|
jpayne@69
|
179 specific example are all size 65536 after decompression - which
|
jpayne@69
|
180 does not always hold) and starting at byte 126 of the fourth block
|
jpayne@69
|
181 (after decompression). For BGZF, we need to know the fourth block's
|
jpayne@69
|
182 offset of 55074 and the offset within the block of 126 to get the
|
jpayne@69
|
183 BGZF virtual offset.
|
jpayne@69
|
184
|
jpayne@69
|
185 >>> print(55074 << 16 | 126)
|
jpayne@69
|
186 3609329790
|
jpayne@69
|
187 >>> print(bgzf.make_virtual_offset(55074, 126))
|
jpayne@69
|
188 3609329790
|
jpayne@69
|
189
|
jpayne@69
|
190 Thus for this BGZF file, decompressed position 196734 corresponds
|
jpayne@69
|
191 to the virtual offset 3609329790. However, another BGZF file with
|
jpayne@69
|
192 different contents would have compressed more or less efficiently,
|
jpayne@69
|
193 so the compressed blocks would be different sizes. What this means
|
jpayne@69
|
194 is the mapping between the uncompressed offset and the compressed
|
jpayne@69
|
195 virtual offset depends on the BGZF file you are using.
|
jpayne@69
|
196
|
jpayne@69
|
197 If you are accessing a BGZF file via this module, just use the
|
jpayne@69
|
198 handle.tell() method to note the virtual offset of a position you
|
jpayne@69
|
199 may later want to return to using handle.seek().
|
jpayne@69
|
200
|
jpayne@69
|
201 The catch with BGZF virtual offsets is while they can be compared
|
jpayne@69
|
202 (which offset comes first in the file), you cannot safely subtract
|
jpayne@69
|
203 them to get the size of the data between them, nor add/subtract
|
jpayne@69
|
204 a relative offset.
|
jpayne@69
|
205
|
jpayne@69
|
206 Of course you can parse this file with Bio.SeqIO using BgzfReader,
|
jpayne@69
|
207 although there isn't any benefit over using gzip.open(...), unless
|
jpayne@69
|
208 you want to index BGZF compressed sequence files:
|
jpayne@69
|
209
|
jpayne@69
|
210 >>> from Bio import SeqIO
|
jpayne@69
|
211 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz")
|
jpayne@69
|
212 >>> record = SeqIO.read(handle, "genbank")
|
jpayne@69
|
213 >>> handle.close()
|
jpayne@69
|
214 >>> print(record.id)
|
jpayne@69
|
215 NC_000932.1
|
jpayne@69
|
216
|
jpayne@69
|
217 Text Mode
|
jpayne@69
|
218 ---------
|
jpayne@69
|
219
|
jpayne@69
|
220 Like the standard library gzip.open(...), the BGZF code defaults to opening
|
jpayne@69
|
221 files in binary mode.
|
jpayne@69
|
222
|
jpayne@69
|
223 You can request the file be opened in text mode, but beware that this is hard
|
jpayne@69
|
224 coded to the simple "latin1" (aka "iso-8859-1") encoding (which includes all
|
jpayne@69
|
225 the ASCII characters), which works well with most Western European languages.
|
jpayne@69
|
226 However, it is not fully compatible with the more widely used UTF-8 encoding.
|
jpayne@69
|
227
|
jpayne@69
|
228 In variable width encodings like UTF-8, some single characters in the unicode
|
jpayne@69
|
229 text output are represented by multiple bytes in the raw binary form. This is
|
jpayne@69
|
230 problematic with BGZF, as we cannot always decode each block in isolation - a
|
jpayne@69
|
231 single unicode character could be split over two blocks. This can even happen
|
jpayne@69
|
232 with fixed width unicode encodings, as the BGZF block size is not fixed.
|
jpayne@69
|
233
|
jpayne@69
|
234 Therefore, this module is currently restricted to only support single byte
|
jpayne@69
|
235 unicode encodings, such as ASCII, "latin1" (which is a superset of ASCII), or
|
jpayne@69
|
236 potentially other character maps (not implemented).
|
jpayne@69
|
237
|
jpayne@69
|
238 Furthermore, unlike the default text mode on Python 3, we do not attempt to
|
jpayne@69
|
239 implement universal new line mode. This transforms the various operating system
|
jpayne@69
|
240 new line conventions like Windows (CR LF or "\r\n"), Unix (just LF, "\n"), or
|
jpayne@69
|
241 old Macs (just CR, "\r"), into just LF ("\n"). Here we have the same problem -
|
jpayne@69
|
242 is "\r" at the end of a block an incomplete Windows style new line?
|
jpayne@69
|
243
|
jpayne@69
|
244 Instead, you will get the CR ("\r") and LF ("\n") characters as is.
|
jpayne@69
|
245
|
jpayne@69
|
246 If your data is in UTF-8 or any other incompatible encoding, you must use
|
jpayne@69
|
247 binary mode, and decode the appropriate fragments yourself.
|
jpayne@69
|
248 """
|
jpayne@69
|
249
|
jpayne@69
|
250 import struct
|
jpayne@69
|
251 import sys
|
jpayne@69
|
252 import zlib
|
jpayne@69
|
253
|
jpayne@69
|
254 from builtins import open as _open
|
jpayne@69
|
255
|
jpayne@69
|
256 _bgzf_magic = b"\x1f\x8b\x08\x04"
|
jpayne@69
|
257 _bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00"
|
jpayne@69
|
258 _bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00"
|
jpayne@69
|
259 _bytes_BC = b"BC"
|
jpayne@69
|
260
|
jpayne@69
|
261
|
jpayne@69
|
262 def open(filename, mode="rb"):
|
jpayne@69
|
263 r"""Open a BGZF file for reading, writing or appending.
|
jpayne@69
|
264
|
jpayne@69
|
265 If text mode is requested, in order to avoid multi-byte characters, this is
|
jpayne@69
|
266 hard coded to use the "latin1" encoding, and "\r" and "\n" are passed as is
|
jpayne@69
|
267 (without implementing universal new line mode).
|
jpayne@69
|
268
|
jpayne@69
|
269 If your data is in UTF-8 or any other incompatible encoding, you must use
|
jpayne@69
|
270 binary mode, and decode the appropriate fragments yourself.
|
jpayne@69
|
271 """
|
jpayne@69
|
272 if "r" in mode.lower():
|
jpayne@69
|
273 return BgzfReader(filename, mode)
|
jpayne@69
|
274 elif "w" in mode.lower() or "a" in mode.lower():
|
jpayne@69
|
275 return BgzfWriter(filename, mode)
|
jpayne@69
|
276 else:
|
jpayne@69
|
277 raise ValueError(f"Bad mode {mode!r}")
|
jpayne@69
|
278
|
jpayne@69
|
279
|
jpayne@69
|
280 def make_virtual_offset(block_start_offset, within_block_offset):
|
jpayne@69
|
281 """Compute a BGZF virtual offset from block start and within block offsets.
|
jpayne@69
|
282
|
jpayne@69
|
283 The BAM indexing scheme records read positions using a 64 bit
|
jpayne@69
|
284 'virtual offset', comprising in C terms:
|
jpayne@69
|
285
|
jpayne@69
|
286 block_start_offset << 16 | within_block_offset
|
jpayne@69
|
287
|
jpayne@69
|
288 Here block_start_offset is the file offset of the BGZF block
|
jpayne@69
|
289 start (unsigned integer using up to 64-16 = 48 bits), and
|
jpayne@69
|
290 within_block_offset within the (decompressed) block (unsigned
|
jpayne@69
|
291 16 bit integer).
|
jpayne@69
|
292
|
jpayne@69
|
293 >>> make_virtual_offset(0, 0)
|
jpayne@69
|
294 0
|
jpayne@69
|
295 >>> make_virtual_offset(0, 1)
|
jpayne@69
|
296 1
|
jpayne@69
|
297 >>> make_virtual_offset(0, 2**16 - 1)
|
jpayne@69
|
298 65535
|
jpayne@69
|
299 >>> make_virtual_offset(0, 2**16)
|
jpayne@69
|
300 Traceback (most recent call last):
|
jpayne@69
|
301 ...
|
jpayne@69
|
302 ValueError: Require 0 <= within_block_offset < 2**16, got 65536
|
jpayne@69
|
303
|
jpayne@69
|
304 >>> 65536 == make_virtual_offset(1, 0)
|
jpayne@69
|
305 True
|
jpayne@69
|
306 >>> 65537 == make_virtual_offset(1, 1)
|
jpayne@69
|
307 True
|
jpayne@69
|
308 >>> 131071 == make_virtual_offset(1, 2**16 - 1)
|
jpayne@69
|
309 True
|
jpayne@69
|
310
|
jpayne@69
|
311 >>> 6553600000 == make_virtual_offset(100000, 0)
|
jpayne@69
|
312 True
|
jpayne@69
|
313 >>> 6553600001 == make_virtual_offset(100000, 1)
|
jpayne@69
|
314 True
|
jpayne@69
|
315 >>> 6553600010 == make_virtual_offset(100000, 10)
|
jpayne@69
|
316 True
|
jpayne@69
|
317
|
jpayne@69
|
318 >>> make_virtual_offset(2**48, 0)
|
jpayne@69
|
319 Traceback (most recent call last):
|
jpayne@69
|
320 ...
|
jpayne@69
|
321 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656
|
jpayne@69
|
322
|
jpayne@69
|
323 """
|
jpayne@69
|
324 if within_block_offset < 0 or within_block_offset >= 65536:
|
jpayne@69
|
325 raise ValueError(
|
jpayne@69
|
326 "Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset
|
jpayne@69
|
327 )
|
jpayne@69
|
328 if block_start_offset < 0 or block_start_offset >= 281474976710656:
|
jpayne@69
|
329 raise ValueError(
|
jpayne@69
|
330 "Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset
|
jpayne@69
|
331 )
|
jpayne@69
|
332 return (block_start_offset << 16) | within_block_offset
|
jpayne@69
|
333
|
jpayne@69
|
334
|
jpayne@69
|
335 def split_virtual_offset(virtual_offset):
|
jpayne@69
|
336 """Divides a 64-bit BGZF virtual offset into block start & within block offsets.
|
jpayne@69
|
337
|
jpayne@69
|
338 >>> (100000, 0) == split_virtual_offset(6553600000)
|
jpayne@69
|
339 True
|
jpayne@69
|
340 >>> (100000, 10) == split_virtual_offset(6553600010)
|
jpayne@69
|
341 True
|
jpayne@69
|
342
|
jpayne@69
|
343 """
|
jpayne@69
|
344 start = virtual_offset >> 16
|
jpayne@69
|
345 return start, virtual_offset ^ (start << 16)
|
jpayne@69
|
346
|
jpayne@69
|
347
|
jpayne@69
|
348 def BgzfBlocks(handle):
|
jpayne@69
|
349 """Low level debugging function to inspect BGZF blocks.
|
jpayne@69
|
350
|
jpayne@69
|
351 Expects a BGZF compressed file opened in binary read mode using
|
jpayne@69
|
352 the builtin open function. Do not use a handle from this bgzf
|
jpayne@69
|
353 module or the gzip module's open function which will decompress
|
jpayne@69
|
354 the file.
|
jpayne@69
|
355
|
jpayne@69
|
356 Returns the block start offset (see virtual offsets), the block
|
jpayne@69
|
357 length (add these for the start of the next block), and the
|
jpayne@69
|
358 decompressed length of the blocks contents (limited to 65536 in
|
jpayne@69
|
359 BGZF), as an iterator - one tuple per BGZF block.
|
jpayne@69
|
360
|
jpayne@69
|
361 >>> from builtins import open
|
jpayne@69
|
362 >>> handle = open("SamBam/ex1.bam", "rb")
|
jpayne@69
|
363 >>> for values in BgzfBlocks(handle):
|
jpayne@69
|
364 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
|
jpayne@69
|
365 Raw start 0, raw length 18239; data start 0, data length 65536
|
jpayne@69
|
366 Raw start 18239, raw length 18223; data start 65536, data length 65536
|
jpayne@69
|
367 Raw start 36462, raw length 18017; data start 131072, data length 65536
|
jpayne@69
|
368 Raw start 54479, raw length 17342; data start 196608, data length 65536
|
jpayne@69
|
369 Raw start 71821, raw length 17715; data start 262144, data length 65536
|
jpayne@69
|
370 Raw start 89536, raw length 17728; data start 327680, data length 65536
|
jpayne@69
|
371 Raw start 107264, raw length 17292; data start 393216, data length 63398
|
jpayne@69
|
372 Raw start 124556, raw length 28; data start 456614, data length 0
|
jpayne@69
|
373 >>> handle.close()
|
jpayne@69
|
374
|
jpayne@69
|
375 Indirectly we can tell this file came from an old version of
|
jpayne@69
|
376 samtools because all the blocks (except the final one and the
|
jpayne@69
|
377 dummy empty EOF marker block) are 65536 bytes. Later versions
|
jpayne@69
|
378 avoid splitting a read between two blocks, and give the header
|
jpayne@69
|
379 its own block (useful to speed up replacing the header). You
|
jpayne@69
|
380 can see this in ex1_refresh.bam created using samtools 0.1.18:
|
jpayne@69
|
381
|
jpayne@69
|
382 samtools view -b ex1.bam > ex1_refresh.bam
|
jpayne@69
|
383
|
jpayne@69
|
384 >>> handle = open("SamBam/ex1_refresh.bam", "rb")
|
jpayne@69
|
385 >>> for values in BgzfBlocks(handle):
|
jpayne@69
|
386 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
|
jpayne@69
|
387 Raw start 0, raw length 53; data start 0, data length 38
|
jpayne@69
|
388 Raw start 53, raw length 18195; data start 38, data length 65434
|
jpayne@69
|
389 Raw start 18248, raw length 18190; data start 65472, data length 65409
|
jpayne@69
|
390 Raw start 36438, raw length 18004; data start 130881, data length 65483
|
jpayne@69
|
391 Raw start 54442, raw length 17353; data start 196364, data length 65519
|
jpayne@69
|
392 Raw start 71795, raw length 17708; data start 261883, data length 65411
|
jpayne@69
|
393 Raw start 89503, raw length 17709; data start 327294, data length 65466
|
jpayne@69
|
394 Raw start 107212, raw length 17390; data start 392760, data length 63854
|
jpayne@69
|
395 Raw start 124602, raw length 28; data start 456614, data length 0
|
jpayne@69
|
396 >>> handle.close()
|
jpayne@69
|
397
|
jpayne@69
|
398 The above example has no embedded SAM header (thus the first block
|
jpayne@69
|
399 is very small at just 38 bytes of decompressed data), while the next
|
jpayne@69
|
400 example does (a larger block of 103 bytes). Notice that the rest of
|
jpayne@69
|
401 the blocks show the same sizes (they contain the same read data):
|
jpayne@69
|
402
|
jpayne@69
|
403 >>> handle = open("SamBam/ex1_header.bam", "rb")
|
jpayne@69
|
404 >>> for values in BgzfBlocks(handle):
|
jpayne@69
|
405 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
|
jpayne@69
|
406 Raw start 0, raw length 104; data start 0, data length 103
|
jpayne@69
|
407 Raw start 104, raw length 18195; data start 103, data length 65434
|
jpayne@69
|
408 Raw start 18299, raw length 18190; data start 65537, data length 65409
|
jpayne@69
|
409 Raw start 36489, raw length 18004; data start 130946, data length 65483
|
jpayne@69
|
410 Raw start 54493, raw length 17353; data start 196429, data length 65519
|
jpayne@69
|
411 Raw start 71846, raw length 17708; data start 261948, data length 65411
|
jpayne@69
|
412 Raw start 89554, raw length 17709; data start 327359, data length 65466
|
jpayne@69
|
413 Raw start 107263, raw length 17390; data start 392825, data length 63854
|
jpayne@69
|
414 Raw start 124653, raw length 28; data start 456679, data length 0
|
jpayne@69
|
415 >>> handle.close()
|
jpayne@69
|
416
|
jpayne@69
|
417 """
|
jpayne@69
|
418 if isinstance(handle, BgzfReader):
|
jpayne@69
|
419 raise TypeError("Function BgzfBlocks expects a binary handle")
|
jpayne@69
|
420 data_start = 0
|
jpayne@69
|
421 while True:
|
jpayne@69
|
422 start_offset = handle.tell()
|
jpayne@69
|
423 try:
|
jpayne@69
|
424 block_length, data = _load_bgzf_block(handle)
|
jpayne@69
|
425 except StopIteration:
|
jpayne@69
|
426 break
|
jpayne@69
|
427 data_len = len(data)
|
jpayne@69
|
428 yield start_offset, block_length, data_start, data_len
|
jpayne@69
|
429 data_start += data_len
|
jpayne@69
|
430
|
jpayne@69
|
431
|
jpayne@69
|
432 def _load_bgzf_block(handle, text_mode=False):
|
jpayne@69
|
433 """Load the next BGZF block of compressed data (PRIVATE).
|
jpayne@69
|
434
|
jpayne@69
|
435 Returns a tuple (block size and data), or at end of file
|
jpayne@69
|
436 will raise StopIteration.
|
jpayne@69
|
437 """
|
jpayne@69
|
438 magic = handle.read(4)
|
jpayne@69
|
439 if not magic:
|
jpayne@69
|
440 # End of file - should we signal this differently now?
|
jpayne@69
|
441 # See https://www.python.org/dev/peps/pep-0479/
|
jpayne@69
|
442 raise StopIteration
|
jpayne@69
|
443 if magic != _bgzf_magic:
|
jpayne@69
|
444 raise ValueError(
|
jpayne@69
|
445 r"A BGZF (e.g. a BAM file) block should start with "
|
jpayne@69
|
446 r"%r, not %r; handle.tell() now says %r"
|
jpayne@69
|
447 % (_bgzf_magic, magic, handle.tell())
|
jpayne@69
|
448 )
|
jpayne@69
|
449 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = struct.unpack(
|
jpayne@69
|
450 "<LBBH", handle.read(8)
|
jpayne@69
|
451 )
|
jpayne@69
|
452
|
jpayne@69
|
453 block_size = None
|
jpayne@69
|
454 x_len = 0
|
jpayne@69
|
455 while x_len < extra_len:
|
jpayne@69
|
456 subfield_id = handle.read(2)
|
jpayne@69
|
457 subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t
|
jpayne@69
|
458 subfield_data = handle.read(subfield_len)
|
jpayne@69
|
459 x_len += subfield_len + 4
|
jpayne@69
|
460 if subfield_id == _bytes_BC:
|
jpayne@69
|
461 if subfield_len != 2:
|
jpayne@69
|
462 raise ValueError("Wrong BC payload length")
|
jpayne@69
|
463 if block_size is not None:
|
jpayne@69
|
464 raise ValueError("Two BC subfields?")
|
jpayne@69
|
465 block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t
|
jpayne@69
|
466 if x_len != extra_len:
|
jpayne@69
|
467 raise ValueError(f"x_len and extra_len differ {x_len}, {extra_len}")
|
jpayne@69
|
468 if block_size is None:
|
jpayne@69
|
469 raise ValueError("Missing BC, this isn't a BGZF file!")
|
jpayne@69
|
470 # Now comes the compressed data, CRC, and length of uncompressed data.
|
jpayne@69
|
471 deflate_size = block_size - 1 - extra_len - 19
|
jpayne@69
|
472 d = zlib.decompressobj(-15) # Negative window size means no headers
|
jpayne@69
|
473 data = d.decompress(handle.read(deflate_size)) + d.flush()
|
jpayne@69
|
474 expected_crc = handle.read(4)
|
jpayne@69
|
475 expected_size = struct.unpack("<I", handle.read(4))[0]
|
jpayne@69
|
476 if expected_size != len(data):
|
jpayne@69
|
477 raise RuntimeError("Decompressed to %i, not %i" % (len(data), expected_size))
|
jpayne@69
|
478 # Should cope with a mix of Python platforms...
|
jpayne@69
|
479 crc = zlib.crc32(data)
|
jpayne@69
|
480 if crc < 0:
|
jpayne@69
|
481 crc = struct.pack("<i", crc)
|
jpayne@69
|
482 else:
|
jpayne@69
|
483 crc = struct.pack("<I", crc)
|
jpayne@69
|
484 if expected_crc != crc:
|
jpayne@69
|
485 raise RuntimeError(f"CRC is {crc}, not {expected_crc}")
|
jpayne@69
|
486 if text_mode:
|
jpayne@69
|
487 # Note ISO-8859-1 aka Latin-1 preserves first 256 chars
|
jpayne@69
|
488 # (i.e. ASCII), but critically is a single byte encoding
|
jpayne@69
|
489 return block_size, data.decode("latin-1")
|
jpayne@69
|
490 else:
|
jpayne@69
|
491 return block_size, data
|
jpayne@69
|
492
|
jpayne@69
|
493
|
jpayne@69
|
494 class BgzfReader:
|
jpayne@69
|
495 r"""BGZF reader, acts like a read only handle but seek/tell differ.
|
jpayne@69
|
496
|
jpayne@69
|
497 Let's use the BgzfBlocks function to have a peek at the BGZF blocks
|
jpayne@69
|
498 in an example BAM file,
|
jpayne@69
|
499
|
jpayne@69
|
500 >>> from builtins import open
|
jpayne@69
|
501 >>> handle = open("SamBam/ex1.bam", "rb")
|
jpayne@69
|
502 >>> for values in BgzfBlocks(handle):
|
jpayne@69
|
503 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values)
|
jpayne@69
|
504 Raw start 0, raw length 18239; data start 0, data length 65536
|
jpayne@69
|
505 Raw start 18239, raw length 18223; data start 65536, data length 65536
|
jpayne@69
|
506 Raw start 36462, raw length 18017; data start 131072, data length 65536
|
jpayne@69
|
507 Raw start 54479, raw length 17342; data start 196608, data length 65536
|
jpayne@69
|
508 Raw start 71821, raw length 17715; data start 262144, data length 65536
|
jpayne@69
|
509 Raw start 89536, raw length 17728; data start 327680, data length 65536
|
jpayne@69
|
510 Raw start 107264, raw length 17292; data start 393216, data length 63398
|
jpayne@69
|
511 Raw start 124556, raw length 28; data start 456614, data length 0
|
jpayne@69
|
512 >>> handle.close()
|
jpayne@69
|
513
|
jpayne@69
|
514 Now let's see how to use this block information to jump to
|
jpayne@69
|
515 specific parts of the decompressed BAM file:
|
jpayne@69
|
516
|
jpayne@69
|
517 >>> handle = BgzfReader("SamBam/ex1.bam", "rb")
|
jpayne@69
|
518 >>> assert 0 == handle.tell()
|
jpayne@69
|
519 >>> magic = handle.read(4)
|
jpayne@69
|
520 >>> assert 4 == handle.tell()
|
jpayne@69
|
521
|
jpayne@69
|
522 So far nothing so strange, we got the magic marker used at the
|
jpayne@69
|
523 start of a decompressed BAM file, and the handle position makes
|
jpayne@69
|
524 sense. Now however, let's jump to the end of this block and 4
|
jpayne@69
|
525 bytes into the next block by reading 65536 bytes,
|
jpayne@69
|
526
|
jpayne@69
|
527 >>> data = handle.read(65536)
|
jpayne@69
|
528 >>> len(data)
|
jpayne@69
|
529 65536
|
jpayne@69
|
530 >>> assert 1195311108 == handle.tell()
|
jpayne@69
|
531
|
jpayne@69
|
532 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit
|
jpayne@69
|
533 virtual offset, which means:
|
jpayne@69
|
534
|
jpayne@69
|
535 >>> split_virtual_offset(1195311108)
|
jpayne@69
|
536 (18239, 4)
|
jpayne@69
|
537
|
jpayne@69
|
538 You should spot 18239 as the start of the second BGZF block, while
|
jpayne@69
|
539 the 4 is the offset into this block. See also make_virtual_offset,
|
jpayne@69
|
540
|
jpayne@69
|
541 >>> make_virtual_offset(18239, 4)
|
jpayne@69
|
542 1195311108
|
jpayne@69
|
543
|
jpayne@69
|
544 Let's jump back to almost the start of the file,
|
jpayne@69
|
545
|
jpayne@69
|
546 >>> make_virtual_offset(0, 2)
|
jpayne@69
|
547 2
|
jpayne@69
|
548 >>> handle.seek(2)
|
jpayne@69
|
549 2
|
jpayne@69
|
550 >>> handle.close()
|
jpayne@69
|
551
|
jpayne@69
|
552 Note that you can use the max_cache argument to limit the number of
|
jpayne@69
|
553 BGZF blocks cached in memory. The default is 100, and since each
|
jpayne@69
|
554 block can be up to 64kb, the default cache could take up to 6MB of
|
jpayne@69
|
555 RAM. The cache is not important for reading through the file in one
|
jpayne@69
|
556 pass, but is important for improving performance of random access.
|
jpayne@69
|
557 """
|
jpayne@69
|
558
|
jpayne@69
|
559 def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
|
jpayne@69
|
560 r"""Initialize the class for reading a BGZF file.
|
jpayne@69
|
561
|
jpayne@69
|
562 You would typically use the top level ``bgzf.open(...)`` function
|
jpayne@69
|
563 which will call this class internally. Direct use is discouraged.
|
jpayne@69
|
564
|
jpayne@69
|
565 Either the ``filename`` (string) or ``fileobj`` (input file object in
|
jpayne@69
|
566 binary mode) arguments must be supplied, but not both.
|
jpayne@69
|
567
|
jpayne@69
|
568 Argument ``mode`` controls if the data will be returned as strings in
|
jpayne@69
|
569 text mode ("rt", "tr", or default "r"), or bytes binary mode ("rb"
|
jpayne@69
|
570 or "br"). The argument name matches the built-in ``open(...)`` and
|
jpayne@69
|
571 standard library ``gzip.open(...)`` function.
|
jpayne@69
|
572
|
jpayne@69
|
573 If text mode is requested, in order to avoid multi-byte characters,
|
jpayne@69
|
574 this is hard coded to use the "latin1" encoding, and "\r" and "\n"
|
jpayne@69
|
575 are passed as is (without implementing universal new line mode). There
|
jpayne@69
|
576 is no ``encoding`` argument.
|
jpayne@69
|
577
|
jpayne@69
|
578 If your data is in UTF-8 or any other incompatible encoding, you must
|
jpayne@69
|
579 use binary mode, and decode the appropriate fragments yourself.
|
jpayne@69
|
580
|
jpayne@69
|
581 Argument ``max_cache`` controls the maximum number of BGZF blocks to
|
jpayne@69
|
582 cache in memory. Each can be up to 64kb thus the default of 100 blocks
|
jpayne@69
|
583 could take up to 6MB of RAM. This is important for efficient random
|
jpayne@69
|
584 access, a small value is fine for reading the file in one pass.
|
jpayne@69
|
585 """
|
jpayne@69
|
586 # TODO - Assuming we can seek, check for 28 bytes EOF empty block
|
jpayne@69
|
587 # and if missing warn about possible truncation (as in samtools)?
|
jpayne@69
|
588 if max_cache < 1:
|
jpayne@69
|
589 raise ValueError("Use max_cache with a minimum of 1")
|
jpayne@69
|
590 # Must open the BGZF file in binary mode, but we may want to
|
jpayne@69
|
591 # treat the contents as either text or binary (unicode or
|
jpayne@69
|
592 # bytes under Python 3)
|
jpayne@69
|
593 if filename and fileobj:
|
jpayne@69
|
594 raise ValueError("Supply either filename or fileobj, not both")
|
jpayne@69
|
595 # Want to reject output modes like w, a, x, +
|
jpayne@69
|
596 if mode.lower() not in ("r", "tr", "rt", "rb", "br"):
|
jpayne@69
|
597 raise ValueError(
|
jpayne@69
|
598 "Must use a read mode like 'r' (default), 'rt', or 'rb' for binary"
|
jpayne@69
|
599 )
|
jpayne@69
|
600 # If an open file was passed, make sure it was opened in binary mode.
|
jpayne@69
|
601 if fileobj:
|
jpayne@69
|
602 if fileobj.read(0) != b"":
|
jpayne@69
|
603 raise ValueError("fileobj not opened in binary mode")
|
jpayne@69
|
604 handle = fileobj
|
jpayne@69
|
605 else:
|
jpayne@69
|
606 handle = _open(filename, "rb")
|
jpayne@69
|
607 self._text = "b" not in mode.lower()
|
jpayne@69
|
608 if self._text:
|
jpayne@69
|
609 self._newline = "\n"
|
jpayne@69
|
610 else:
|
jpayne@69
|
611 self._newline = b"\n"
|
jpayne@69
|
612 self._handle = handle
|
jpayne@69
|
613 self.max_cache = max_cache
|
jpayne@69
|
614 self._buffers = {}
|
jpayne@69
|
615 self._block_start_offset = None
|
jpayne@69
|
616 self._block_raw_length = None
|
jpayne@69
|
617 self._load_block(handle.tell())
|
jpayne@69
|
618
|
jpayne@69
|
619 def _load_block(self, start_offset=None):
|
jpayne@69
|
620 if start_offset is None:
|
jpayne@69
|
621 # If the file is being read sequentially, then _handle.tell()
|
jpayne@69
|
622 # should be pointing at the start of the next block.
|
jpayne@69
|
623 # However, if seek has been used, we can't assume that.
|
jpayne@69
|
624 start_offset = self._block_start_offset + self._block_raw_length
|
jpayne@69
|
625 if start_offset == self._block_start_offset:
|
jpayne@69
|
626 self._within_block_offset = 0
|
jpayne@69
|
627 return
|
jpayne@69
|
628 elif start_offset in self._buffers:
|
jpayne@69
|
629 # Already in cache
|
jpayne@69
|
630 self._buffer, self._block_raw_length = self._buffers[start_offset]
|
jpayne@69
|
631 self._within_block_offset = 0
|
jpayne@69
|
632 self._block_start_offset = start_offset
|
jpayne@69
|
633 return
|
jpayne@69
|
634 # Must hit the disk... first check cache limits,
|
jpayne@69
|
635 while len(self._buffers) >= self.max_cache:
|
jpayne@69
|
636 # TODO - Implement LRU cache removal?
|
jpayne@69
|
637 self._buffers.popitem()
|
jpayne@69
|
638 # Now load the block
|
jpayne@69
|
639 handle = self._handle
|
jpayne@69
|
640 if start_offset is not None:
|
jpayne@69
|
641 handle.seek(start_offset)
|
jpayne@69
|
642 self._block_start_offset = handle.tell()
|
jpayne@69
|
643 try:
|
jpayne@69
|
644 block_size, self._buffer = _load_bgzf_block(handle, self._text)
|
jpayne@69
|
645 except StopIteration:
|
jpayne@69
|
646 # EOF
|
jpayne@69
|
647 block_size = 0
|
jpayne@69
|
648 if self._text:
|
jpayne@69
|
649 self._buffer = ""
|
jpayne@69
|
650 else:
|
jpayne@69
|
651 self._buffer = b""
|
jpayne@69
|
652 self._within_block_offset = 0
|
jpayne@69
|
653 self._block_raw_length = block_size
|
jpayne@69
|
654 # Finally save the block in our cache,
|
jpayne@69
|
655 self._buffers[self._block_start_offset] = self._buffer, block_size
|
jpayne@69
|
656
|
jpayne@69
|
657 def tell(self):
|
jpayne@69
|
658 """Return a 64-bit unsigned BGZF virtual offset."""
|
jpayne@69
|
659 if 0 < self._within_block_offset and self._within_block_offset == len(
|
jpayne@69
|
660 self._buffer
|
jpayne@69
|
661 ):
|
jpayne@69
|
662 # Special case where we're right at the end of a (non empty) block.
|
jpayne@69
|
663 # For non-maximal blocks could give two possible virtual offsets,
|
jpayne@69
|
664 # but for a maximal block can't use 65536 as the within block
|
jpayne@69
|
665 # offset. Therefore for consistency, use the next block and a
|
jpayne@69
|
666 # within block offset of zero.
|
jpayne@69
|
667 return (self._block_start_offset + self._block_raw_length) << 16
|
jpayne@69
|
668 else:
|
jpayne@69
|
669 # return make_virtual_offset(self._block_start_offset,
|
jpayne@69
|
670 # self._within_block_offset)
|
jpayne@69
|
671 # TODO - Include bounds checking as in make_virtual_offset?
|
jpayne@69
|
672 return (self._block_start_offset << 16) | self._within_block_offset
|
jpayne@69
|
673
|
jpayne@69
|
674 def seek(self, virtual_offset):
|
jpayne@69
|
675 """Seek to a 64-bit unsigned BGZF virtual offset."""
|
jpayne@69
|
676 # Do this inline to avoid a function call,
|
jpayne@69
|
677 # start_offset, within_block = split_virtual_offset(virtual_offset)
|
jpayne@69
|
678 start_offset = virtual_offset >> 16
|
jpayne@69
|
679 within_block = virtual_offset ^ (start_offset << 16)
|
jpayne@69
|
680 if start_offset != self._block_start_offset:
|
jpayne@69
|
681 # Don't need to load the block if already there
|
jpayne@69
|
682 # (this avoids a function call since _load_block would do nothing)
|
jpayne@69
|
683 self._load_block(start_offset)
|
jpayne@69
|
684 if start_offset != self._block_start_offset:
|
jpayne@69
|
685 raise ValueError("start_offset not loaded correctly")
|
jpayne@69
|
686 if within_block > len(self._buffer):
|
jpayne@69
|
687 if not (within_block == 0 and len(self._buffer) == 0):
|
jpayne@69
|
688 raise ValueError(
|
jpayne@69
|
689 "Within offset %i but block size only %i"
|
jpayne@69
|
690 % (within_block, len(self._buffer))
|
jpayne@69
|
691 )
|
jpayne@69
|
692 self._within_block_offset = within_block
|
jpayne@69
|
693 # assert virtual_offset == self.tell(), \
|
jpayne@69
|
694 # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \
|
jpayne@69
|
695 # % (virtual_offset, start_offset, within_block,
|
jpayne@69
|
696 # self.tell(), self._block_start_offset,
|
jpayne@69
|
697 # self._within_block_offset)
|
jpayne@69
|
698 return virtual_offset
|
jpayne@69
|
699
|
jpayne@69
|
700 def read(self, size=-1):
|
jpayne@69
|
701 """Read method for the BGZF module."""
|
jpayne@69
|
702 if size < 0:
|
jpayne@69
|
703 raise NotImplementedError("Don't be greedy, that could be massive!")
|
jpayne@69
|
704
|
jpayne@69
|
705 result = "" if self._text else b""
|
jpayne@69
|
706 while size and self._block_raw_length:
|
jpayne@69
|
707 if self._within_block_offset + size <= len(self._buffer):
|
jpayne@69
|
708 # This may leave us right at the end of a block
|
jpayne@69
|
709 # (lazy loading, don't load the next block unless we have too)
|
jpayne@69
|
710 data = self._buffer[
|
jpayne@69
|
711 self._within_block_offset : self._within_block_offset + size
|
jpayne@69
|
712 ]
|
jpayne@69
|
713 self._within_block_offset += size
|
jpayne@69
|
714 if not data:
|
jpayne@69
|
715 raise ValueError("Must be at least 1 byte")
|
jpayne@69
|
716 result += data
|
jpayne@69
|
717 break
|
jpayne@69
|
718 else:
|
jpayne@69
|
719 data = self._buffer[self._within_block_offset :]
|
jpayne@69
|
720 size -= len(data)
|
jpayne@69
|
721 self._load_block() # will reset offsets
|
jpayne@69
|
722 result += data
|
jpayne@69
|
723
|
jpayne@69
|
724 return result
|
jpayne@69
|
725
|
jpayne@69
|
726 def readline(self):
|
jpayne@69
|
727 """Read a single line for the BGZF file."""
|
jpayne@69
|
728 result = "" if self._text else b""
|
jpayne@69
|
729 while self._block_raw_length:
|
jpayne@69
|
730 i = self._buffer.find(self._newline, self._within_block_offset)
|
jpayne@69
|
731 # Three cases to consider,
|
jpayne@69
|
732 if i == -1:
|
jpayne@69
|
733 # No newline, need to read in more data
|
jpayne@69
|
734 data = self._buffer[self._within_block_offset :]
|
jpayne@69
|
735 self._load_block() # will reset offsets
|
jpayne@69
|
736 result += data
|
jpayne@69
|
737 elif i + 1 == len(self._buffer):
|
jpayne@69
|
738 # Found new line, but right at end of block (SPECIAL)
|
jpayne@69
|
739 data = self._buffer[self._within_block_offset :]
|
jpayne@69
|
740 # Must now load the next block to ensure tell() works
|
jpayne@69
|
741 self._load_block() # will reset offsets
|
jpayne@69
|
742 if not data:
|
jpayne@69
|
743 raise ValueError("Must be at least 1 byte")
|
jpayne@69
|
744 result += data
|
jpayne@69
|
745 break
|
jpayne@69
|
746 else:
|
jpayne@69
|
747 # Found new line, not at end of block (easy case, no IO)
|
jpayne@69
|
748 data = self._buffer[self._within_block_offset : i + 1]
|
jpayne@69
|
749 self._within_block_offset = i + 1
|
jpayne@69
|
750 # assert data.endswith(self._newline)
|
jpayne@69
|
751 result += data
|
jpayne@69
|
752 break
|
jpayne@69
|
753
|
jpayne@69
|
754 return result
|
jpayne@69
|
755
|
jpayne@69
|
756 def __next__(self):
|
jpayne@69
|
757 """Return the next line."""
|
jpayne@69
|
758 line = self.readline()
|
jpayne@69
|
759 if not line:
|
jpayne@69
|
760 raise StopIteration
|
jpayne@69
|
761 return line
|
jpayne@69
|
762
|
jpayne@69
|
763 def __iter__(self):
|
jpayne@69
|
764 """Iterate over the lines in the BGZF file."""
|
jpayne@69
|
765 return self
|
jpayne@69
|
766
|
jpayne@69
|
767 def close(self):
|
jpayne@69
|
768 """Close BGZF file."""
|
jpayne@69
|
769 self._handle.close()
|
jpayne@69
|
770 self._buffer = None
|
jpayne@69
|
771 self._block_start_offset = None
|
jpayne@69
|
772 self._buffers = None
|
jpayne@69
|
773
|
jpayne@69
|
774 def seekable(self):
|
jpayne@69
|
775 """Return True indicating the BGZF supports random access."""
|
jpayne@69
|
776 return True
|
jpayne@69
|
777
|
jpayne@69
|
778 def isatty(self):
|
jpayne@69
|
779 """Return True if connected to a TTY device."""
|
jpayne@69
|
780 return False
|
jpayne@69
|
781
|
jpayne@69
|
782 def fileno(self):
|
jpayne@69
|
783 """Return integer file descriptor."""
|
jpayne@69
|
784 return self._handle.fileno()
|
jpayne@69
|
785
|
jpayne@69
|
786 def __enter__(self):
|
jpayne@69
|
787 """Open a file operable with WITH statement."""
|
jpayne@69
|
788 return self
|
jpayne@69
|
789
|
jpayne@69
|
790 def __exit__(self, type, value, traceback):
|
jpayne@69
|
791 """Close a file with WITH statement."""
|
jpayne@69
|
792 self.close()
|
jpayne@69
|
793
|
jpayne@69
|
794
|
jpayne@69
|
795 class BgzfWriter:
|
jpayne@69
|
796 """Define a BGZFWriter object."""
|
jpayne@69
|
797
|
jpayne@69
|
798 def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
|
jpayne@69
|
799 """Initilize the class."""
|
jpayne@69
|
800 if filename and fileobj:
|
jpayne@69
|
801 raise ValueError("Supply either filename or fileobj, not both")
|
jpayne@69
|
802 # If an open file was passed, make sure it was opened in binary mode.
|
jpayne@69
|
803 if fileobj:
|
jpayne@69
|
804 if fileobj.read(0) != b"":
|
jpayne@69
|
805 raise ValueError("fileobj not opened in binary mode")
|
jpayne@69
|
806 handle = fileobj
|
jpayne@69
|
807 else:
|
jpayne@69
|
808 if "w" not in mode.lower() and "a" not in mode.lower():
|
jpayne@69
|
809 raise ValueError(f"Must use write or append mode, not {mode!r}")
|
jpayne@69
|
810 if "a" in mode.lower():
|
jpayne@69
|
811 handle = _open(filename, "ab")
|
jpayne@69
|
812 else:
|
jpayne@69
|
813 handle = _open(filename, "wb")
|
jpayne@69
|
814 self._text = "b" not in mode.lower()
|
jpayne@69
|
815 self._handle = handle
|
jpayne@69
|
816 self._buffer = b""
|
jpayne@69
|
817 self.compresslevel = compresslevel
|
jpayne@69
|
818
|
jpayne@69
|
819 def _write_block(self, block):
|
jpayne@69
|
820 """Write provided data to file as a single BGZF compressed block (PRIVATE)."""
|
jpayne@69
|
821 # print("Saving %i bytes" % len(block))
|
jpayne@69
|
822 if len(block) > 65536:
|
jpayne@69
|
823 raise ValueError(f"{len(block)} Block length > 65536")
|
jpayne@69
|
824 # Giving a negative window bits means no gzip/zlib headers,
|
jpayne@69
|
825 # -15 used in samtools
|
jpayne@69
|
826 c = zlib.compressobj(
|
jpayne@69
|
827 self.compresslevel, zlib.DEFLATED, -15, zlib.DEF_MEM_LEVEL, 0
|
jpayne@69
|
828 )
|
jpayne@69
|
829 compressed = c.compress(block) + c.flush()
|
jpayne@69
|
830 del c
|
jpayne@69
|
831 if len(compressed) > 65536:
|
jpayne@69
|
832 raise RuntimeError(
|
jpayne@69
|
833 "TODO - Didn't compress enough, try less data in this block"
|
jpayne@69
|
834 )
|
jpayne@69
|
835 crc = zlib.crc32(block)
|
jpayne@69
|
836 # Should cope with a mix of Python platforms...
|
jpayne@69
|
837 if crc < 0:
|
jpayne@69
|
838 crc = struct.pack("<i", crc)
|
jpayne@69
|
839 else:
|
jpayne@69
|
840 crc = struct.pack("<I", crc)
|
jpayne@69
|
841 bsize = struct.pack("<H", len(compressed) + 25) # includes -1
|
jpayne@69
|
842 crc = struct.pack("<I", zlib.crc32(block) & 0xFFFFFFFF)
|
jpayne@69
|
843 uncompressed_length = struct.pack("<I", len(block))
|
jpayne@69
|
844 # Fixed 16 bytes,
|
jpayne@69
|
845 # gzip magic bytes (4) mod time (4),
|
jpayne@69
|
846 # gzip flag (1), os (1), extra length which is six (2),
|
jpayne@69
|
847 # sub field which is BC (2), sub field length of two (2),
|
jpayne@69
|
848 # Variable data,
|
jpayne@69
|
849 # 2 bytes: block length as BC sub field (2)
|
jpayne@69
|
850 # X bytes: the data
|
jpayne@69
|
851 # 8 bytes: crc (4), uncompressed data length (4)
|
jpayne@69
|
852 data = _bgzf_header + bsize + compressed + crc + uncompressed_length
|
jpayne@69
|
853 self._handle.write(data)
|
jpayne@69
|
854
|
jpayne@69
|
855 def write(self, data):
|
jpayne@69
|
856 """Write method for the class."""
|
jpayne@69
|
857 # TODO - Check bytes vs unicode
|
jpayne@69
|
858 if isinstance(data, str):
|
jpayne@69
|
859 # When reading we can't cope with multi-byte characters
|
jpayne@69
|
860 # being split between BGZF blocks, so we restrict to a
|
jpayne@69
|
861 # single byte encoding - like ASCII or latin-1.
|
jpayne@69
|
862 # On output we could probably allow any encoding, as we
|
jpayne@69
|
863 # don't care about splitting unicode characters between blocks
|
jpayne@69
|
864 data = data.encode("latin-1")
|
jpayne@69
|
865 # block_size = 2**16 = 65536
|
jpayne@69
|
866 data_len = len(data)
|
jpayne@69
|
867 if len(self._buffer) + data_len < 65536:
|
jpayne@69
|
868 # print("Cached %r" % data)
|
jpayne@69
|
869 self._buffer += data
|
jpayne@69
|
870 else:
|
jpayne@69
|
871 # print("Got %r, writing out some data..." % data)
|
jpayne@69
|
872 self._buffer += data
|
jpayne@69
|
873 while len(self._buffer) >= 65536:
|
jpayne@69
|
874 self._write_block(self._buffer[:65536])
|
jpayne@69
|
875 self._buffer = self._buffer[65536:]
|
jpayne@69
|
876
|
jpayne@69
|
877 def flush(self):
|
jpayne@69
|
878 """Flush data explicitally."""
|
jpayne@69
|
879 while len(self._buffer) >= 65536:
|
jpayne@69
|
880 self._write_block(self._buffer[:65535])
|
jpayne@69
|
881 self._buffer = self._buffer[65535:]
|
jpayne@69
|
882 self._write_block(self._buffer)
|
jpayne@69
|
883 self._buffer = b""
|
jpayne@69
|
884 self._handle.flush()
|
jpayne@69
|
885
|
jpayne@69
|
886 def close(self):
|
jpayne@69
|
887 """Flush data, write 28 bytes BGZF EOF marker, and close BGZF file.
|
jpayne@69
|
888
|
jpayne@69
|
889 samtools will look for a magic EOF marker, just a 28 byte empty BGZF
|
jpayne@69
|
890 block, and if it is missing warns the BAM file may be truncated. In
|
jpayne@69
|
891 addition to samtools writing this block, so too does bgzip - so this
|
jpayne@69
|
892 implementation does too.
|
jpayne@69
|
893 """
|
jpayne@69
|
894 if self._buffer:
|
jpayne@69
|
895 self.flush()
|
jpayne@69
|
896 self._handle.write(_bgzf_eof)
|
jpayne@69
|
897 self._handle.flush()
|
jpayne@69
|
898 self._handle.close()
|
jpayne@69
|
899
|
jpayne@69
|
900 def tell(self):
|
jpayne@69
|
901 """Return a BGZF 64-bit virtual offset."""
|
jpayne@69
|
902 return make_virtual_offset(self._handle.tell(), len(self._buffer))
|
jpayne@69
|
903
|
jpayne@69
|
904 def seekable(self):
|
jpayne@69
|
905 """Return True indicating the BGZF supports random access."""
|
jpayne@69
|
906 # Not seekable, but we do support tell...
|
jpayne@69
|
907 return False
|
jpayne@69
|
908
|
jpayne@69
|
909 def isatty(self):
|
jpayne@69
|
910 """Return True if connected to a TTY device."""
|
jpayne@69
|
911 return False
|
jpayne@69
|
912
|
jpayne@69
|
913 def fileno(self):
|
jpayne@69
|
914 """Return integer file descriptor."""
|
jpayne@69
|
915 return self._handle.fileno()
|
jpayne@69
|
916
|
jpayne@69
|
917 def __enter__(self):
|
jpayne@69
|
918 """Open a file operable with WITH statement."""
|
jpayne@69
|
919 return self
|
jpayne@69
|
920
|
jpayne@69
|
921 def __exit__(self, type, value, traceback):
|
jpayne@69
|
922 """Close a file with WITH statement."""
|
jpayne@69
|
923 self.close()
|
jpayne@69
|
924
|
jpayne@69
|
925
|
jpayne@69
|
926 if __name__ == "__main__":
|
jpayne@69
|
927 if len(sys.argv) > 1:
|
jpayne@69
|
928 print("Call this with no arguments and pipe uncompressed data in on stdin")
|
jpayne@69
|
929 print("and it will produce BGZF compressed data on stdout. e.g.")
|
jpayne@69
|
930 print("")
|
jpayne@69
|
931 print("./bgzf.py < example.fastq > example.fastq.bgz")
|
jpayne@69
|
932 print("")
|
jpayne@69
|
933 print("The extension convention of *.bgz is to distinugish these from *.gz")
|
jpayne@69
|
934 print("used for standard gzipped files without the block structure of BGZF.")
|
jpayne@69
|
935 print("You can use the standard gunzip command to decompress BGZF files,")
|
jpayne@69
|
936 print("if it complains about the extension try something like this:")
|
jpayne@69
|
937 print("")
|
jpayne@69
|
938 print("cat example.fastq.bgz | gunzip > example.fastq")
|
jpayne@69
|
939 print("")
|
jpayne@69
|
940 print("See also the tool bgzip that comes with samtools")
|
jpayne@69
|
941 sys.exit(0)
|
jpayne@69
|
942
|
jpayne@69
|
943 # Ensure we have binary mode handles
|
jpayne@69
|
944 # (leave stderr as default text mode)
|
jpayne@69
|
945 stdin = sys.stdin.buffer
|
jpayne@69
|
946 stdout = sys.stdout.buffer
|
jpayne@69
|
947
|
jpayne@69
|
948 sys.stderr.write("Producing BGZF output from stdin...\n")
|
jpayne@69
|
949 w = BgzfWriter(fileobj=stdout)
|
jpayne@69
|
950 while True:
|
jpayne@69
|
951 data = stdin.read(65536)
|
jpayne@69
|
952 w.write(data)
|
jpayne@69
|
953 if not data:
|
jpayne@69
|
954 break
|
jpayne@69
|
955 # Doing close will write an empty BGZF block as EOF marker:
|
jpayne@69
|
956 w.close()
|
jpayne@69
|
957 sys.stderr.write("BGZF data produced\n")
|