1
2
3
4
5
6
7 r"""Read and write BGZF compressed files (the GZIP variant used in BAM).
8
9 The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text
10 format (SAM), and a compressed binary format (BAM). The latter uses a
11 modified form of gzip compression called BGZF (Blocked GNU Zip Format),
12 which can be applied to any file format to provide compression with
13 efficient random access. BGZF is described together with the SAM/BAM
14 file format at http://samtools.sourceforge.net/SAM1.pdf
15
16 Please read the text below about 'virtual offsets' before using BGZF
17 files for random access.
18
19
20 Aim of this module
21 ------------------
22
23 The Python gzip library can be used to read BGZF files, since for
24 decompression they are just (specialised) gzip files. What this
25 module aims to facilitate is random access to BGZF files (using the
26 'virtual offset' idea), and writing BGZF files (which means using
27 suitably sized gzip blocks and writing the extra 'BC' field in the
28 gzip headers). As in the gzip library, the zlib library is used
29 internally.
30
31 In addition to being required for random access to and writing of
32 BAM files, the BGZF format can also be used on other sequential
33 data (in the sense of one record after another), such as most of
34 the sequence data formats supported in Bio.SeqIO (like FASTA,
35 FASTQ, GenBank, etc) or large MAF alignments.
36
37 The Bio.SeqIO indexing functions use this module to support BGZF files.
38
39
40 Technical Introduction to BGZF
41 ------------------------------
42
43 The gzip file format allows multiple compressed blocks, each of which
44 could be a stand alone gzip file. As an interesting bonus, this means
45 you can use Unix "cat" to combined to gzip files into one by
46 concatenating them. Also, each block can have one of several compression
47 levels (including uncompressed, which actually takes up a little bit
48 more space due to the gzip header).
49
50 What the BAM designers realised was that while random access to data
51 stored in traditional gzip files was slow, breaking the file into
52 gzip blocks would allow fast random access to each block. To access
53 a particular piece of the decompressed data, you just need to know
54 which block it starts in (the offset of the gzip block start), and
55 how far into the (decompressed) contents of the block you need to
56 read.
57
58 One problem with this is finding the gzip block sizes efficiently.
59 You can do it with a standard gzip file, but it requires every block
60 to be decompressed -- and that would be rather slow. Additionally
61 typical gzip files may use very large blocks.
62
63 All that differs in BGZF is that compressed size of each gzip block
64 is limited to 2^16 bytes, and an extra 'BC' field in the gzip header
65 records this size. Traditional decompression tools can ignore this,
66 and unzip the file just like any other gzip file.
67
68 The point of this is you can look at the first BGZF block, find out
69 how big it is from this 'BC' header, and thus seek immediately to
70 the second block, and so on.
71
72 The BAM indexing scheme records read positions using a 64 bit
73 'virtual offset', comprising coffset<<16|uoffset, where coffset is
74 the file offset of the BGZF block containing the start of the read
75 (unsigned integer using up to 64-16 = 48 bits), and uoffset is the
76 offset within the (decompressed) block (unsigned 16 bit integer).
77
78 This limits you to BAM files where the last block starts by 2^48
79 bytes, or 256 petabytes, and the decompressed size of each block
80 is at most 2^16 bytes, or 64kb. Note that this matches the BGZF
81 'BC' field size which limits the compressed size of each block to
82 2^16 bytes, allowing for BAM files to use BGZF with no gzip
83 compression (useful for intermediate files in memory to reduced
84 CPU load).
85
86
87 Warning about namespaces
88 ------------------------
89
90 It is considered a bad idea to use "from XXX import *" in Python, because
91 it pollutes the namespace. This is a real issue with Bio.bgzf (and the
92 standard Python library gzip) because they contain a function called open
93 i.e. Suppose you do this:
94
95 >>> from Bio.bgzf import *
96 >>> print open.__module__
97 Bio.bgzf
98
99 Or,
100
101 >>> from gzip import *
102 >>> print open.__module__
103 gzip
104
105 Notice that the open function has been replaced. You can "fix" this if you
106 need to by importing the built-in open function:
107
108 >>> from __builtin__ import open
109
110 However, what we recommend instead is to use the explicit namespace, e.g.
111
112 >>> from Bio import bgzf
113 >>> print bgzf.open.__module__
114 Bio.bgzf
115
116
117 Example
118 -------
119
120 This is an ordinary GenBank file compressed using BGZF, so it can
121 be decompressed using gzip,
122
123 >>> import gzip
124 >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r")
125 >>> assert 0 == handle.tell()
126 >>> line = handle.readline()
127 >>> assert 80 == handle.tell()
128 >>> line = handle.readline()
129 >>> assert 143 == handle.tell()
130 >>> data = handle.read(70000)
131 >>> assert 70143 == handle.tell()
132 >>> handle.close()
133
134 We can also access the file using the BGZF reader - but pay
135 attention to the file offsets which will be explained below:
136
137 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r")
138 >>> assert 0 == handle.tell()
139 >>> print handle.readline().rstrip()
140 LOCUS NC_000932 154478 bp DNA circular PLN 15-APR-2009
141 >>> assert 80 == handle.tell()
142 >>> print handle.readline().rstrip()
143 DEFINITION Arabidopsis thaliana chloroplast, complete genome.
144 >>> assert 143 == handle.tell()
145 >>> data = handle.read(70000)
146 >>> assert 987828735 == handle.tell()
147 >>> print handle.readline().rstrip()
148 f="GeneID:844718"
149 >>> print handle.readline().rstrip()
150 CDS complement(join(84337..84771,85454..85843))
151 >>> offset = handle.seek(make_virtual_offset(55074, 126))
152 >>> print handle.readline().rstrip()
153 68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat
154 >>> handle.close()
155
156 Notice the handle's offset looks different as a BGZF file. This
157 brings us to the key point about BGZF, which is the block structure:
158
159 >>> handle = open("GenBank/NC_000932.gb.bgz", "rb")
160 >>> for values in BgzfBlocks(handle):
161 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values
162 Raw start 0, raw length 15073; data start 0, data length 65536
163 Raw start 15073, raw length 17857; data start 65536, data length 65536
164 Raw start 32930, raw length 22144; data start 131072, data length 65536
165 Raw start 55074, raw length 22230; data start 196608, data length 65536
166 Raw start 77304, raw length 14939; data start 262144, data length 43478
167 Raw start 92243, raw length 28; data start 305622, data length 0
168 >>> handle.close()
169
170 In this example the first three blocks are 'full' and hold 65536 bytes
171 of uncompressed data. The fourth block isn't full and holds 43478 bytes.
172 Finally there is a special empty fifth block which takes 28 bytes on
173 disk and serves as an 'end of file' (EOF) marker. If this is missing,
174 it is possible your BGZF file is incomplete.
175
176 By reading ahead 70,000 bytes we moved into the second BGZF block,
177 and at that point the BGZF virtual offsets start to look different
178 a simple offset into the decompressed data as exposed by the gzip
179 library.
180
181 As an example, consider seeking to the decompressed position 196734.
182 Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this
183 is equivalent to jumping the first three blocks (which in this
184 specific example are all size 65536 after decompression - which
185 does not always hold) and starting at byte 126 of the fourth block
186 (after decompression). For BGZF, we need to know the fourth block's
187 offset of 55074 and the offset within the block of 126 to get the
188 BGZF virtual offset.
189
190 >>> print 55074 << 16 | 126
191 3609329790
192 >>> print bgzf.make_virtual_offset(55074, 126)
193 3609329790
194
195 Thus for this BGZF file, decompressed position 196734 corresponds
196 to the virtual offset 3609329790. However, another BGZF file with
197 different contents would have compressed more or less efficiently,
198 so the compressed blocks would be different sizes. What this means
199 is the mapping between the uncompressed offset and the compressed
200 virtual offset depends on the BGZF file you are using.
201
202 If you are accessing a BGZF file via this module, just use the
203 handle.tell() method to note the virtual offset of a position you
204 may later want to return to using handle.seek().
205
206 The catch with BGZF virtual offsets is while they can be compared
207 (which offset comes first in the file), you cannot safely subtract
208 them to get the size of the data between them, nor add/subtract
209 a relative offset.
210
211 Of course you can parse this file with Bio.SeqIO using BgzfReader,
212 although there isn't any benefit over using gzip.open(...), unless
213 you want to index BGZF compressed sequence files:
214
215 >>> from Bio import SeqIO
216 >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz")
217 >>> record = SeqIO.read(handle, "genbank")
218 >>> handle.close()
219 >>> print record.id
220 NC_000932.1
221
222 """
223
224 import zlib
225 import struct
226 import __builtin__
227
228 from Bio._py3k import _as_bytes, _as_string
229
230
231
232 _bgzf_magic = _as_bytes("\x1f\x8b\x08\x04")
233 _bgzf_header = _as_bytes("\x1f\x8b\x08\x04\x00\x00\x00\x00"
234 "\x00\xff\x06\x00\x42\x43\x02\x00")
235 _bgzf_eof = _as_bytes("\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC" +
236 "\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00")
237 _bytes_BC = _as_bytes("BC")
238 _empty_bytes_string = _as_bytes("")
239 _bytes_newline = _as_bytes("\n")
240
241
242 -def open(filename, mode="rb"):
243 """Open a BGZF file for reading, writing or appending."""
244 if "r" in mode.lower():
245 return BgzfReader(filename, mode)
246 elif "w" in mode.lower() or "a" in mode.lower():
247 return BgzfWriter(filename, mode)
248 else:
249 raise ValueError("Bad mode %r" % mode)
250
251
253 """Compute a BGZF virtual offset from block start and within block offsets.
254
255 The BAM indexing scheme records read positions using a 64 bit
256 'virtual offset', comprising in C terms:
257
258 block_start_offset<<16 | within_block_offset
259
260 Here block_start_offset is the file offset of the BGZF block
261 start (unsigned integer using up to 64-16 = 48 bits), and
262 within_block_offset within the (decompressed) block (unsigned
263 16 bit integer).
264
265 >>> make_virtual_offset(0,0)
266 0
267 >>> make_virtual_offset(0,1)
268 1
269 >>> make_virtual_offset(0, 2**16 - 1)
270 65535
271 >>> make_virtual_offset(0, 2**16)
272 Traceback (most recent call last):
273 ...
274 ValueError: Require 0 <= within_block_offset < 2**16, got 65536
275
276 >>> 65536 == make_virtual_offset(1,0)
277 True
278 >>> 65537 == make_virtual_offset(1,1)
279 True
280 >>> 131071 == make_virtual_offset(1, 2**16 - 1)
281 True
282
283 >>> 6553600000 == make_virtual_offset(100000,0)
284 True
285 >>> 6553600001 == make_virtual_offset(100000,1)
286 True
287 >>> 6553600010 == make_virtual_offset(100000,10)
288 True
289
290 >>> make_virtual_offset(2**48,0)
291 Traceback (most recent call last):
292 ...
293 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656
294
295 """
296 if within_block_offset < 0 or within_block_offset >= 65536:
297 raise ValueError("Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset)
298 if block_start_offset < 0 or block_start_offset >= 281474976710656:
299 raise ValueError("Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset)
300 return (block_start_offset<<16) | within_block_offset
301
302
304 """Divides a 64-bit BGZF virtual offset into block start & within block offsets.
305
306 >>> (100000, 0) == split_virtual_offset(6553600000)
307 True
308 >>> (100000, 10) == split_virtual_offset(6553600010)
309 True
310
311 """
312 start = virtual_offset>>16
313 return start, virtual_offset ^ (start<<16)
314
315
317 """Low level debugging function to inspect BGZF blocks.
318
319 Returns the block start offset (see virtual offsets), the block
320 length (add these for the start of the next block), and the
321 decompressed length of the blocks contents (limited to 65536 in
322 BGZF).
323
324 >>> from __builtin__ import open
325 >>> handle = open("SamBam/ex1.bam", "rb")
326 >>> for values in BgzfBlocks(handle):
327 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values
328 Raw start 0, raw length 18239; data start 0, data length 65536
329 Raw start 18239, raw length 18223; data start 65536, data length 65536
330 Raw start 36462, raw length 18017; data start 131072, data length 65536
331 Raw start 54479, raw length 17342; data start 196608, data length 65536
332 Raw start 71821, raw length 17715; data start 262144, data length 65536
333 Raw start 89536, raw length 17728; data start 327680, data length 65536
334 Raw start 107264, raw length 17292; data start 393216, data length 63398
335 Raw start 124556, raw length 28; data start 456614, data length 0
336 >>> handle.close()
337
338 Indirectly we can tell this file came from an old version of
339 samtools because all the blocks (except the final one and the
340 dummy empty EOF marker block) are 65536 bytes. Later versions
341 avoid splitting a read between two blocks, and give the header
342 its own block (useful to speed up replacing the header). You
343 can see this in ex1_refresh.bam created using samtools 0.1.18:
344
345 samtools view -b ex1.bam > ex1_refresh.bam
346
347 >>> handle = open("SamBam/ex1_refresh.bam", "rb")
348 >>> for values in BgzfBlocks(handle):
349 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values
350 Raw start 0, raw length 53; data start 0, data length 38
351 Raw start 53, raw length 18195; data start 38, data length 65434
352 Raw start 18248, raw length 18190; data start 65472, data length 65409
353 Raw start 36438, raw length 18004; data start 130881, data length 65483
354 Raw start 54442, raw length 17353; data start 196364, data length 65519
355 Raw start 71795, raw length 17708; data start 261883, data length 65411
356 Raw start 89503, raw length 17709; data start 327294, data length 65466
357 Raw start 107212, raw length 17390; data start 392760, data length 63854
358 Raw start 124602, raw length 28; data start 456614, data length 0
359 >>> handle.close()
360
361 The above example has no embedded SAM header (thus the first block
362 is very small at just 38 bytes of decompressed data), while the next
363 example does (a larger block of 103 bytes). Notice that the rest of
364 the blocks show the same sizes (they contain the same read data):
365
366 >>> handle = open("SamBam/ex1_header.bam", "rb")
367 >>> for values in BgzfBlocks(handle):
368 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values
369 Raw start 0, raw length 104; data start 0, data length 103
370 Raw start 104, raw length 18195; data start 103, data length 65434
371 Raw start 18299, raw length 18190; data start 65537, data length 65409
372 Raw start 36489, raw length 18004; data start 130946, data length 65483
373 Raw start 54493, raw length 17353; data start 196429, data length 65519
374 Raw start 71846, raw length 17708; data start 261948, data length 65411
375 Raw start 89554, raw length 17709; data start 327359, data length 65466
376 Raw start 107263, raw length 17390; data start 392825, data length 63854
377 Raw start 124653, raw length 28; data start 456679, data length 0
378 >>> handle.close()
379
380 """
381 data_start = 0
382 while True:
383 start_offset = handle.tell()
384
385 block_length, data = _load_bgzf_block(handle)
386 data_len = len(data)
387 yield start_offset, block_length, data_start, data_len
388 data_start += data_len
389
390
392 """Internal function to load the next BGZF function (PRIVATE)."""
393 magic = handle.read(4)
394 if not magic:
395
396 raise StopIteration
397 if magic != _bgzf_magic:
398 raise ValueError(r"A BGZF (e.g. a BAM file) block should start with "
399 r"%r, not %r; handle.tell() now says %r"
400 % (_bgzf_magic, magic, handle.tell()))
401 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = \
402 struct.unpack("<LBBH", handle.read(8))
403
404 block_size = None
405 x_len = 0
406 while x_len < extra_len:
407 subfield_id = handle.read(2)
408 subfield_len = struct.unpack("<H", handle.read(2))[0]
409 subfield_data = handle.read(subfield_len)
410 x_len += subfield_len + 4
411 if subfield_id == _bytes_BC:
412 assert subfield_len == 2, "Wrong BC payload length"
413 assert block_size is None, "Two BC subfields?"
414 block_size = struct.unpack("<H", subfield_data)[0] + 1
415 assert x_len == extra_len, (x_len, extra_len)
416 assert block_size is not None, "Missing BC, this isn't a BGZF file!"
417
418 deflate_size = block_size - 1 - extra_len - 19
419 d = zlib.decompressobj(-15)
420 data = d.decompress(handle.read(deflate_size)) + d.flush()
421 expected_crc = handle.read(4)
422 expected_size = struct.unpack("<I", handle.read(4))[0]
423 assert expected_size == len(data), \
424 "Decompressed to %i, not %i" % (len(data), expected_size)
425
426 crc = zlib.crc32(data)
427 if crc < 0:
428 crc = struct.pack("<i", crc)
429 else:
430 crc = struct.pack("<I", crc)
431 assert expected_crc == crc, \
432 "CRC is %s, not %s" % (crc, expected_crc)
433 if text_mode:
434 return block_size, _as_string(data)
435 else:
436 return block_size, data
437
438
440 r"""BGZF reader, acts like a read only handle but seek/tell differ.
441
442 Let's use the BgzfBlocks function to have a peak at the BGZF blocks
443 in an example BAM file,
444
445 >>> from __builtin__ import open
446 >>> handle = open("SamBam/ex1.bam", "rb")
447 >>> for values in BgzfBlocks(handle):
448 ... print "Raw start %i, raw length %i; data start %i, data length %i" % values
449 Raw start 0, raw length 18239; data start 0, data length 65536
450 Raw start 18239, raw length 18223; data start 65536, data length 65536
451 Raw start 36462, raw length 18017; data start 131072, data length 65536
452 Raw start 54479, raw length 17342; data start 196608, data length 65536
453 Raw start 71821, raw length 17715; data start 262144, data length 65536
454 Raw start 89536, raw length 17728; data start 327680, data length 65536
455 Raw start 107264, raw length 17292; data start 393216, data length 63398
456 Raw start 124556, raw length 28; data start 456614, data length 0
457 >>> handle.close()
458
459 Now let's see how to use this block information to jump to
460 specific parts of the decompressed BAM file:
461
462 >>> handle = BgzfReader("SamBam/ex1.bam", "rb")
463 >>> assert 0 == handle.tell()
464 >>> magic = handle.read(4)
465 >>> assert 4 == handle.tell()
466
467 So far nothing so strange, we got the magic marker used at the
468 start of a decompressed BAM file, and the handle position makes
469 sense. Now however, let's jump to the end of this block and 4
470 bytes into the next block by reading 65536 bytes,
471
472 >>> data = handle.read(65536)
473 >>> len(data)
474 65536
475 >>> assert 1195311108 == handle.tell()
476
477 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit
478 virtual offset, which means:
479
480 >>> split_virtual_offset(1195311108)
481 (18239, 4)
482
483 You should spot 18239 as the start of the second BGZF block, while
484 the 4 is the offset into this block. See also make_virtual_offset,
485
486 >>> make_virtual_offset(18239, 4)
487 1195311108
488
489 Let's jump back to almost the start of the file,
490
491 >>> make_virtual_offset(0, 2)
492 2
493 >>> handle.seek(2)
494 2
495 >>> handle.close()
496
497 Note that you can use the max_cache argument to limit the number of
498 BGZF blocks cached in memory. The default is 100, and since each
499 block can be up to 64kb, the default cache could take up to 6MB of
500 RAM. The cache is not important for reading through the file in one
501 pass, but is important for improving performance of random access.
502 """
503
504 - def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
505
506
507 if max_cache < 1:
508 raise ValueError("Use max_cache with a minimum of 1")
509
510
511
512 if fileobj:
513 assert filename is None
514 handle = fileobj
515 assert "b" in handle.mode.lower()
516 else:
517 if "w" in mode.lower() \
518 or "a" in mode.lower():
519 raise ValueError("Must use read mode (default), not write or append mode")
520 handle = __builtin__.open(filename, "rb")
521 self._text = "b" not in mode.lower()
522 if self._text:
523 self._newline = "\n"
524 else:
525 self._newline = _bytes_newline
526 self._handle = handle
527 self.max_cache = max_cache
528 self._buffers = {}
529 self._block_start_offset = None
530 self._block_raw_length = None
531 self._load_block(handle.tell())
532
534 if start_offset is None:
535
536
537
538 start_offset = self._block_start_offset + self._block_raw_length
539 if start_offset == self._block_start_offset:
540 self._within_block_offset = 0
541 return
542 elif start_offset in self._buffers:
543
544 self._buffer, self._block_raw_length = self._buffers[start_offset]
545 self._within_block_offset = 0
546 self._block_start_offset = start_offset
547 return
548
549 while len(self._buffers) >= self.max_cache:
550
551 self._buffers.popitem()
552
553 handle = self._handle
554 if start_offset is not None:
555 handle.seek(start_offset)
556 self._block_start_offset = handle.tell()
557 try:
558 block_size, self._buffer = _load_bgzf_block(handle, self._text)
559 except StopIteration:
560
561 block_size = 0
562 if self._text:
563 self._buffer = ""
564 else:
565 self._buffer = _empty_bytes_string
566 self._within_block_offset = 0
567 self._block_raw_length = block_size
568
569 self._buffers[self._block_start_offset] = self._buffer, block_size
570
572 """Returns a 64-bit unsigned BGZF virtual offset."""
573 if 0 < self._within_block_offset == len(self._buffer):
574
575
576
577
578
579 return (self._block_start_offset + self._block_raw_length) << 16
580 else:
581
582
583
584 return (self._block_start_offset<<16) | self._within_block_offset
585
586 - def seek(self, virtual_offset):
587 """Seek to a 64-bit unsigned BGZF virtual offset."""
588
589
590 start_offset = virtual_offset>>16
591 within_block = virtual_offset ^ (start_offset<<16)
592 if start_offset != self._block_start_offset:
593
594
595 self._load_block(start_offset)
596 assert start_offset == self._block_start_offset
597 if within_block >= len(self._buffer) \
598 and not (within_block == 0 and len(self._buffer)==0):
599 raise ValueError("Within offset %i but block size only %i"
600 % (within_block, len(self._buffer)))
601 self._within_block_offset = within_block
602
603
604
605
606 return virtual_offset
607
608 - def read(self, size=-1):
609 if size < 0:
610 raise NotImplementedError("Don't be greedy, that could be massive!")
611 elif size == 0:
612 if self._text:
613 return ""
614 else:
615 return _empty_bytes_string
616 elif self._within_block_offset + size <= len(self._buffer):
617
618
619 data = self._buffer[self._within_block_offset:self._within_block_offset + size]
620 self._within_block_offset += size
621 assert data
622 return data
623 else:
624 data = self._buffer[self._within_block_offset:]
625 size -= len(data)
626 self._load_block()
627
628
629 if not self._buffer:
630 return data
631 elif size:
632
633 return data + self.read(size)
634 else:
635
636 return data
637
639 i = self._buffer.find(self._newline, self._within_block_offset)
640
641 if i==-1:
642
643 data = self._buffer[self._within_block_offset:]
644 self._load_block()
645 if not self._buffer:
646 return data
647 else:
648
649 return data + self.readline()
650 elif i + 1 == len(self._buffer):
651
652 data = self._buffer[self._within_block_offset:]
653
654 self._load_block()
655 assert data
656 return data
657 else:
658
659 data = self._buffer[self._within_block_offset:i+1]
660 self._within_block_offset = i + 1
661
662 return data
663
665 line = self.readline()
666 if not line:
667 raise StopIteration
668 return line
669
672
674 self._handle.close()
675 self._buffer = None
676 self._block_start_offset = None
677 self._buffers = None
678
681
684
686 return self._handle.fileno()
687
688
690
691 - def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
692 if fileobj:
693 assert filename is None
694 handle = fileobj
695 else:
696 if "w" not in mode.lower() \
697 and "a" not in mode.lower():
698 raise ValueError("Must use write or append mode, not %r" % mode)
699 if "a" in mode.lower():
700 handle = __builtin__.open(filename, "ab")
701 else:
702 handle = __builtin__.open(filename, "wb")
703 self._text = "b" not in mode.lower()
704 self._handle = handle
705 self._buffer = _empty_bytes_string
706 self.compresslevel = compresslevel
707
709
710 start_offset = self._handle.tell()
711 assert len(block) <= 65536
712
713 c = zlib.compressobj(self.compresslevel,
714 zlib.DEFLATED,
715 -15,
716 zlib.DEF_MEM_LEVEL,
717 0)
718 compressed = c.compress(block) + c.flush()
719 del c
720 assert len(compressed) < 65536, "TODO - Didn't compress enough, try less data in this block"
721 crc = zlib.crc32(block)
722
723 if crc < 0:
724 crc = struct.pack("<i", crc)
725 else:
726 crc = struct.pack("<I", crc)
727 bsize = struct.pack("<H", len(compressed)+25)
728 crc = struct.pack("<I", zlib.crc32(block) & 0xffffffffL)
729 uncompressed_length = struct.pack("<I", len(block))
730
731
732
733
734
735
736
737
738 data = _bgzf_header + bsize + compressed + crc + uncompressed_length
739 self._handle.write(data)
740
742
743 data = _as_bytes(data)
744
745 data_len = len(data)
746 if len(self._buffer) + data_len < 65536:
747
748 self._buffer += data
749 return
750 else:
751
752 self._buffer += data
753 while len(self._buffer) >= 65536:
754 self._write_block(self._buffer[:65536])
755 self._buffer = self._buffer[65536:]
756
764
766 """Flush data, write 28 bytes empty BGZF EOF marker, and close the BGZF file."""
767 if self._buffer:
768 self.flush()
769
770
771
772 self._handle.write(_bgzf_eof)
773 self._handle.flush()
774 self._handle.close()
775
777 """Returns a BGZF 64-bit virtual offset."""
778 return make_virtual_offset(self._handle.tell(), len(self._buffer))
779
783
786
788 return self._handle.fileno()
789
790
791 if __name__ == "__main__":
792 import sys
793 if len(sys.argv) > 1:
794 print "Call this with no arguments and pipe uncompressed data in on stdin"
795 print "and it will produce BGZF compressed data on stdout. e.g."
796 print
797 print "./bgzf.py < example.fastq > example.fastq.bgz"
798 print
799 print "The extension convention of *.bgz is to distinugish these from *.gz"
800 print "used for standard gzipped files without the block structure of BGZF."
801 print "You can use the standard gunzip command to decompress BGZF files,"
802 print "if it complains about the extension try something like this:"
803 print
804 print "cat example.fastq.bgz | gunzip > example.fastq"
805 print
806 print "See also the tool bgzip that comes with samtools"
807 sys.exit(0)
808
809 sys.stderr.write("Producing BGZF output from stdin...\n")
810 w = BgzfWriter(fileobj=sys.stdout)
811 while True:
812 data = sys.stdin.read(65536)
813 w.write(data)
814 if not data:
815 break
816
817 w.close()
818 sys.stderr.write("BGZF data produced\n")
819