Package Bio :: Module bgzf
[hide private]
[frames] | no frames]

Source Code for Module Bio.bgzf

  1  #!/usr/bin/env python 
  2  # Copyright 2010-2013 by Peter Cock. 
  3  # All rights reserved. 
  4  # This code is part of the Biopython distribution and governed by its 
  5  # license.  Please see the LICENSE file that should have been included 
  6  # as part of this package. 
  7  r"""Read and write BGZF compressed files (the GZIP variant used in BAM). 
  8   
  9  The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text 
 10  format (SAM), and a compressed binary format (BAM). The latter uses a 
 11  modified form of gzip compression called BGZF (Blocked GNU Zip Format), 
 12  which can be applied to any file format to provide compression with 
 13  efficient random access. BGZF is described together with the SAM/BAM 
 14  file format at http://samtools.sourceforge.net/SAM1.pdf 
 15   
 16  Please read the text below about 'virtual offsets' before using BGZF 
 17  files for random access. 
 18   
 19   
 20  Aim of this module 
 21  ------------------ 
 22   
 23  The Python gzip library can be used to read BGZF files, since for 
 24  decompression they are just (specialised) gzip files. What this 
 25  module aims to facilitate is random access to BGZF files (using the 
 26  'virtual offset' idea), and writing BGZF files (which means using 
 27  suitably sized gzip blocks and writing the extra 'BC' field in the 
 28  gzip headers). As in the gzip library, the zlib library is used 
 29  internally. 
 30   
 31  In addition to being required for random access to and writing of 
 32  BAM files, the BGZF format can also be used on other sequential 
 33  data (in the sense of one record after another), such as most of 
 34  the sequence data formats supported in Bio.SeqIO (like FASTA, 
 35  FASTQ, GenBank, etc) or large MAF alignments. 
 36   
 37  The Bio.SeqIO indexing functions use this module to support BGZF files. 
 38   
 39   
 40  Technical Introduction to BGZF 
 41  ------------------------------ 
 42   
 43  The gzip file format allows multiple compressed blocks, each of which 
 44  could be a stand alone gzip file. As an interesting bonus, this means 
 45  you can use Unix "cat" to combined to gzip files into one by 
 46  concatenating them. Also, each block can have one of several compression 
 47  levels (including uncompressed, which actually takes up a little bit 
 48  more space due to the gzip header). 
 49   
 50  What the BAM designers realised was that while random access to data 
 51  stored in traditional gzip files was slow, breaking the file into 
 52  gzip blocks would allow fast random access to each block. To access 
 53  a particular piece of the decompressed data, you just need to know 
 54  which block it starts in (the offset of the gzip block start), and 
 55  how far into the (decompressed) contents of the block you need to 
 56  read. 
 57   
 58  One problem with this is finding the gzip block sizes efficiently. 
 59  You can do it with a standard gzip file, but it requires every block 
 60  to be decompressed -- and that would be rather slow. Additionally 
 61  typical gzip files may use very large blocks. 
 62   
 63  All that differs in BGZF is that compressed size of each gzip block 
 64  is limited to 2^16 bytes, and an extra 'BC' field in the gzip header 
 65  records this size. Traditional decompression tools can ignore this, 
 66  and unzip the file just like any other gzip file. 
 67   
 68  The point of this is you can look at the first BGZF block, find out 
 69  how big it is from this 'BC' header, and thus seek immediately to 
 70  the second block, and so on. 
 71   
 72  The BAM indexing scheme records read positions using a 64 bit 
 73  'virtual offset', comprising coffset << 16 | uoffset, where coffset 
 74  is the file offset of the BGZF block containing the start of the read 
 75  (unsigned integer using up to 64-16 = 48 bits), and uoffset is the 
 76  offset within the (decompressed) block (unsigned 16 bit integer). 
 77   
 78  This limits you to BAM files where the last block starts by 2^48 
 79  bytes, or 256 petabytes, and the decompressed size of each block 
 80  is at most 2^16 bytes, or 64kb. Note that this matches the BGZF 
 81  'BC' field size which limits the compressed size of each block to 
 82  2^16 bytes, allowing for BAM files to use BGZF with no gzip 
 83  compression (useful for intermediate files in memory to reduced 
 84  CPU load). 
 85   
 86   
 87  Warning about namespaces 
 88  ------------------------ 
 89   
 90  It is considered a bad idea to use "from XXX import ``*``" in Python, because 
 91  it pollutes the namespace. This is a real issue with Bio.bgzf (and the 
 92  standard Python library gzip) because they contain a function called open 
 93  i.e. Suppose you do this: 
 94   
 95  >>> from Bio.bgzf import * 
 96  >>> print(open.__module__) 
 97  Bio.bgzf 
 98   
 99  Or, 
100   
101  >>> from gzip import * 
102  >>> print(open.__module__) 
103  gzip 
104   
105  Notice that the open function has been replaced. You can "fix" this if you 
106  need to by importing the built-in open function: 
107   
108  >>> try: 
109  ...     from __builtin__ import open # Python 2 
110  ... except ImportError: 
111  ...     from builtins import open # Python 3 
112  ... 
113   
114  However, what we recommend instead is to use the explicit namespace, e.g. 
115   
116  >>> from Bio import bgzf 
117  >>> print(bgzf.open.__module__) 
118  Bio.bgzf 
119   
120   
121  Example 
122  ------- 
123   
124  This is an ordinary GenBank file compressed using BGZF, so it can 
125  be decompressed using gzip, 
126   
127  >>> import gzip 
128  >>> handle = gzip.open("GenBank/NC_000932.gb.bgz", "r") 
129  >>> assert 0 == handle.tell() 
130  >>> line = handle.readline() 
131  >>> assert 80 == handle.tell() 
132  >>> line = handle.readline() 
133  >>> assert 143 == handle.tell() 
134  >>> data = handle.read(70000) 
135  >>> assert 70143 == handle.tell() 
136  >>> handle.close() 
137   
138  We can also access the file using the BGZF reader - but pay 
139  attention to the file offsets which will be explained below: 
140   
141  >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz", "r") 
142  >>> assert 0 == handle.tell() 
143  >>> print(handle.readline().rstrip()) 
144  LOCUS       NC_000932             154478 bp    DNA     circular PLN 15-APR-2009 
145  >>> assert 80 == handle.tell() 
146  >>> print(handle.readline().rstrip()) 
147  DEFINITION  Arabidopsis thaliana chloroplast, complete genome. 
148  >>> assert 143 == handle.tell() 
149  >>> data = handle.read(70000) 
150  >>> assert 987828735 == handle.tell() 
151  >>> print(handle.readline().rstrip()) 
152  f="GeneID:844718" 
153  >>> print(handle.readline().rstrip()) 
154       CDS             complement(join(84337..84771,85454..85843)) 
155  >>> offset = handle.seek(make_virtual_offset(55074, 126)) 
156  >>> print(handle.readline().rstrip()) 
157      68521 tatgtcattc gaaattgtat aaagacaact cctatttaat agagctattt gtgcaagtat 
158  >>> handle.close() 
159   
160  Notice the handle's offset looks different as a BGZF file. This 
161  brings us to the key point about BGZF, which is the block structure: 
162   
163  >>> handle = open("GenBank/NC_000932.gb.bgz", "rb") 
164  >>> for values in BgzfBlocks(handle): 
165  ...     print("Raw start %i, raw length %i; data start %i, data length %i" % values) 
166  Raw start 0, raw length 15073; data start 0, data length 65536 
167  Raw start 15073, raw length 17857; data start 65536, data length 65536 
168  Raw start 32930, raw length 22144; data start 131072, data length 65536 
169  Raw start 55074, raw length 22230; data start 196608, data length 65536 
170  Raw start 77304, raw length 14939; data start 262144, data length 43478 
171  Raw start 92243, raw length 28; data start 305622, data length 0 
172  >>> handle.close() 
173   
174  In this example the first three blocks are 'full' and hold 65536 bytes 
175  of uncompressed data. The fourth block isn't full and holds 43478 bytes. 
176  Finally there is a special empty fifth block which takes 28 bytes on 
177  disk and serves as an 'end of file' (EOF) marker. If this is missing, 
178  it is possible your BGZF file is incomplete. 
179   
180  By reading ahead 70,000 bytes we moved into the second BGZF block, 
181  and at that point the BGZF virtual offsets start to look different 
182  to a simple offset into the decompressed data as exposed by the gzip 
183  library. 
184   
185  As an example, consider seeking to the decompressed position 196734. 
186  Since 196734 = 65536 + 65536 + 65536 + 126 = 65536*3 + 126, this 
187  is equivalent to jumping the first three blocks (which in this 
188  specific example are all size 65536 after decompression - which 
189  does not always hold) and starting at byte 126 of the fourth block 
190  (after decompression). For BGZF, we need to know the fourth block's 
191  offset of 55074 and the offset within the block of 126 to get the 
192  BGZF virtual offset. 
193   
194  >>> print(55074 << 16 | 126) 
195  3609329790 
196  >>> print(bgzf.make_virtual_offset(55074, 126)) 
197  3609329790 
198   
199  Thus for this BGZF file, decompressed position 196734 corresponds 
200  to the virtual offset 3609329790. However, another BGZF file with 
201  different contents would have compressed more or less efficiently, 
202  so the compressed blocks would be different sizes. What this means 
203  is the mapping between the uncompressed offset and the compressed 
204  virtual offset depends on the BGZF file you are using. 
205   
206  If you are accessing a BGZF file via this module, just use the 
207  handle.tell() method to note the virtual offset of a position you 
208  may later want to return to using handle.seek(). 
209   
210  The catch with BGZF virtual offsets is while they can be compared 
211  (which offset comes first in the file), you cannot safely subtract 
212  them to get the size of the data between them, nor add/subtract 
213  a relative offset. 
214   
215  Of course you can parse this file with Bio.SeqIO using BgzfReader, 
216  although there isn't any benefit over using gzip.open(...), unless 
217  you want to index BGZF compressed sequence files: 
218   
219  >>> from Bio import SeqIO 
220  >>> handle = BgzfReader("GenBank/NC_000932.gb.bgz") 
221  >>> record = SeqIO.read(handle, "genbank") 
222  >>> handle.close() 
223  >>> print(record.id) 
224  NC_000932.1 
225   
226  """ 
227   
228  from __future__ import print_function 
229   
230  import sys  # to detect when under Python 2 
231  import zlib 
232  import struct 
233   
234  from Bio._py3k import _as_bytes, _as_string 
235  from Bio._py3k import open as _open 
236   
237  __docformat__ = "restructuredtext en" 
238   
239  # For Python 2 can just use: _bgzf_magic = '\x1f\x8b\x08\x04' 
240  # but need to use bytes on Python 3 
241  _bgzf_magic = b"\x1f\x8b\x08\x04" 
242  _bgzf_header = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00\x42\x43\x02\x00" 
243  _bgzf_eof = b"\x1f\x8b\x08\x04\x00\x00\x00\x00\x00\xff\x06\x00BC\x02\x00\x1b\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00" 
244  _bytes_BC = b"BC" 
245   
246   
247 -def open(filename, mode="rb"):
248 """Open a BGZF file for reading, writing or appending.""" 249 if "r" in mode.lower(): 250 return BgzfReader(filename, mode) 251 elif "w" in mode.lower() or "a" in mode.lower(): 252 return BgzfWriter(filename, mode) 253 else: 254 raise ValueError("Bad mode %r" % mode)
255 256
257 -def make_virtual_offset(block_start_offset, within_block_offset):
258 """Compute a BGZF virtual offset from block start and within block offsets. 259 260 The BAM indexing scheme records read positions using a 64 bit 261 'virtual offset', comprising in C terms: 262 263 block_start_offset << 16 | within_block_offset 264 265 Here block_start_offset is the file offset of the BGZF block 266 start (unsigned integer using up to 64-16 = 48 bits), and 267 within_block_offset within the (decompressed) block (unsigned 268 16 bit integer). 269 270 >>> make_virtual_offset(0, 0) 271 0 272 >>> make_virtual_offset(0, 1) 273 1 274 >>> make_virtual_offset(0, 2**16 - 1) 275 65535 276 >>> make_virtual_offset(0, 2**16) 277 Traceback (most recent call last): 278 ... 279 ValueError: Require 0 <= within_block_offset < 2**16, got 65536 280 281 >>> 65536 == make_virtual_offset(1, 0) 282 True 283 >>> 65537 == make_virtual_offset(1, 1) 284 True 285 >>> 131071 == make_virtual_offset(1, 2**16 - 1) 286 True 287 288 >>> 6553600000 == make_virtual_offset(100000, 0) 289 True 290 >>> 6553600001 == make_virtual_offset(100000, 1) 291 True 292 >>> 6553600010 == make_virtual_offset(100000, 10) 293 True 294 295 >>> make_virtual_offset(2**48, 0) 296 Traceback (most recent call last): 297 ... 298 ValueError: Require 0 <= block_start_offset < 2**48, got 281474976710656 299 300 """ 301 if within_block_offset < 0 or within_block_offset >= 65536: 302 raise ValueError("Require 0 <= within_block_offset < 2**16, got %i" % within_block_offset) 303 if block_start_offset < 0 or block_start_offset >= 281474976710656: 304 raise ValueError("Require 0 <= block_start_offset < 2**48, got %i" % block_start_offset) 305 return (block_start_offset << 16) | within_block_offset
306 307
308 -def split_virtual_offset(virtual_offset):
309 """Divides a 64-bit BGZF virtual offset into block start & within block offsets. 310 311 >>> (100000, 0) == split_virtual_offset(6553600000) 312 True 313 >>> (100000, 10) == split_virtual_offset(6553600010) 314 True 315 316 """ 317 start = virtual_offset >>16 318 return start, virtual_offset ^ (start << 16)
319 320
321 -def BgzfBlocks(handle):
322 """Low level debugging function to inspect BGZF blocks. 323 324 Expects a BGZF compressed file opened in binary read mode using 325 the builtin open function. Do not use a handle from this bgzf 326 module or the gzip module's open function which will decompress 327 the file. 328 329 Returns the block start offset (see virtual offsets), the block 330 length (add these for the start of the next block), and the 331 decompressed length of the blocks contents (limited to 65536 in 332 BGZF), as an iterator - one tuple per BGZF block. 333 334 >>> try: 335 ... from __builtin__ import open # Python 2 336 ... except ImportError: 337 ... from builtins import open # Python 3 338 ... 339 >>> handle = open("SamBam/ex1.bam", "rb") 340 >>> for values in BgzfBlocks(handle): 341 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) 342 Raw start 0, raw length 18239; data start 0, data length 65536 343 Raw start 18239, raw length 18223; data start 65536, data length 65536 344 Raw start 36462, raw length 18017; data start 131072, data length 65536 345 Raw start 54479, raw length 17342; data start 196608, data length 65536 346 Raw start 71821, raw length 17715; data start 262144, data length 65536 347 Raw start 89536, raw length 17728; data start 327680, data length 65536 348 Raw start 107264, raw length 17292; data start 393216, data length 63398 349 Raw start 124556, raw length 28; data start 456614, data length 0 350 >>> handle.close() 351 352 Indirectly we can tell this file came from an old version of 353 samtools because all the blocks (except the final one and the 354 dummy empty EOF marker block) are 65536 bytes. Later versions 355 avoid splitting a read between two blocks, and give the header 356 its own block (useful to speed up replacing the header). You 357 can see this in ex1_refresh.bam created using samtools 0.1.18: 358 359 samtools view -b ex1.bam > ex1_refresh.bam 360 361 >>> handle = open("SamBam/ex1_refresh.bam", "rb") 362 >>> for values in BgzfBlocks(handle): 363 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) 364 Raw start 0, raw length 53; data start 0, data length 38 365 Raw start 53, raw length 18195; data start 38, data length 65434 366 Raw start 18248, raw length 18190; data start 65472, data length 65409 367 Raw start 36438, raw length 18004; data start 130881, data length 65483 368 Raw start 54442, raw length 17353; data start 196364, data length 65519 369 Raw start 71795, raw length 17708; data start 261883, data length 65411 370 Raw start 89503, raw length 17709; data start 327294, data length 65466 371 Raw start 107212, raw length 17390; data start 392760, data length 63854 372 Raw start 124602, raw length 28; data start 456614, data length 0 373 >>> handle.close() 374 375 The above example has no embedded SAM header (thus the first block 376 is very small at just 38 bytes of decompressed data), while the next 377 example does (a larger block of 103 bytes). Notice that the rest of 378 the blocks show the same sizes (they contain the same read data): 379 380 >>> handle = open("SamBam/ex1_header.bam", "rb") 381 >>> for values in BgzfBlocks(handle): 382 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) 383 Raw start 0, raw length 104; data start 0, data length 103 384 Raw start 104, raw length 18195; data start 103, data length 65434 385 Raw start 18299, raw length 18190; data start 65537, data length 65409 386 Raw start 36489, raw length 18004; data start 130946, data length 65483 387 Raw start 54493, raw length 17353; data start 196429, data length 65519 388 Raw start 71846, raw length 17708; data start 261948, data length 65411 389 Raw start 89554, raw length 17709; data start 327359, data length 65466 390 Raw start 107263, raw length 17390; data start 392825, data length 63854 391 Raw start 124653, raw length 28; data start 456679, data length 0 392 >>> handle.close() 393 394 """ 395 data_start = 0 396 while True: 397 start_offset = handle.tell() 398 # This may raise StopIteration which is perfect here 399 block_length, data = _load_bgzf_block(handle) 400 data_len = len(data) 401 yield start_offset, block_length, data_start, data_len 402 data_start += data_len
403 404
405 -def _load_bgzf_block(handle, text_mode=False):
406 """Internal function to load the next BGZF function (PRIVATE).""" 407 magic = handle.read(4) 408 if not magic: 409 # End of file 410 raise StopIteration 411 if magic != _bgzf_magic: 412 raise ValueError(r"A BGZF (e.g. a BAM file) block should start with " 413 r"%r, not %r; handle.tell() now says %r" 414 % (_bgzf_magic, magic, handle.tell())) 415 gzip_mod_time, gzip_extra_flags, gzip_os, extra_len = \ 416 struct.unpack("<LBBH", handle.read(8)) 417 418 block_size = None 419 x_len = 0 420 while x_len < extra_len: 421 subfield_id = handle.read(2) 422 subfield_len = struct.unpack("<H", handle.read(2))[0] # uint16_t 423 subfield_data = handle.read(subfield_len) 424 x_len += subfield_len + 4 425 if subfield_id == _bytes_BC: 426 assert subfield_len == 2, "Wrong BC payload length" 427 assert block_size is None, "Two BC subfields?" 428 block_size = struct.unpack("<H", subfield_data)[0] + 1 # uint16_t 429 assert x_len == extra_len, (x_len, extra_len) 430 assert block_size is not None, "Missing BC, this isn't a BGZF file!" 431 # Now comes the compressed data, CRC, and length of uncompressed data. 432 deflate_size = block_size - 1 - extra_len - 19 433 d = zlib.decompressobj(-15) # Negative window size means no headers 434 data = d.decompress(handle.read(deflate_size)) + d.flush() 435 expected_crc = handle.read(4) 436 expected_size = struct.unpack("<I", handle.read(4))[0] 437 assert expected_size == len(data), \ 438 "Decompressed to %i, not %i" % (len(data), expected_size) 439 # Should cope with a mix of Python platforms... 440 crc = zlib.crc32(data) 441 if crc < 0: 442 crc = struct.pack("<i", crc) 443 else: 444 crc = struct.pack("<I", crc) 445 assert expected_crc == crc, \ 446 "CRC is %s, not %s" % (crc, expected_crc) 447 if text_mode: 448 return block_size, _as_string(data) 449 else: 450 return block_size, data
451 452
453 -class BgzfReader(object):
454 r"""BGZF reader, acts like a read only handle but seek/tell differ. 455 456 Let's use the BgzfBlocks function to have a peak at the BGZF blocks 457 in an example BAM file, 458 459 >>> try: 460 ... from __builtin__ import open # Python 2 461 ... except ImportError: 462 ... from builtins import open # Python 3 463 ... 464 >>> handle = open("SamBam/ex1.bam", "rb") 465 >>> for values in BgzfBlocks(handle): 466 ... print("Raw start %i, raw length %i; data start %i, data length %i" % values) 467 Raw start 0, raw length 18239; data start 0, data length 65536 468 Raw start 18239, raw length 18223; data start 65536, data length 65536 469 Raw start 36462, raw length 18017; data start 131072, data length 65536 470 Raw start 54479, raw length 17342; data start 196608, data length 65536 471 Raw start 71821, raw length 17715; data start 262144, data length 65536 472 Raw start 89536, raw length 17728; data start 327680, data length 65536 473 Raw start 107264, raw length 17292; data start 393216, data length 63398 474 Raw start 124556, raw length 28; data start 456614, data length 0 475 >>> handle.close() 476 477 Now let's see how to use this block information to jump to 478 specific parts of the decompressed BAM file: 479 480 >>> handle = BgzfReader("SamBam/ex1.bam", "rb") 481 >>> assert 0 == handle.tell() 482 >>> magic = handle.read(4) 483 >>> assert 4 == handle.tell() 484 485 So far nothing so strange, we got the magic marker used at the 486 start of a decompressed BAM file, and the handle position makes 487 sense. Now however, let's jump to the end of this block and 4 488 bytes into the next block by reading 65536 bytes, 489 490 >>> data = handle.read(65536) 491 >>> len(data) 492 65536 493 >>> assert 1195311108 == handle.tell() 494 495 Expecting 4 + 65536 = 65540 were you? Well this is a BGZF 64-bit 496 virtual offset, which means: 497 498 >>> split_virtual_offset(1195311108) 499 (18239, 4) 500 501 You should spot 18239 as the start of the second BGZF block, while 502 the 4 is the offset into this block. See also make_virtual_offset, 503 504 >>> make_virtual_offset(18239, 4) 505 1195311108 506 507 Let's jump back to almost the start of the file, 508 509 >>> make_virtual_offset(0, 2) 510 2 511 >>> handle.seek(2) 512 2 513 >>> handle.close() 514 515 Note that you can use the max_cache argument to limit the number of 516 BGZF blocks cached in memory. The default is 100, and since each 517 block can be up to 64kb, the default cache could take up to 6MB of 518 RAM. The cache is not important for reading through the file in one 519 pass, but is important for improving performance of random access. 520 """ 521
522 - def __init__(self, filename=None, mode="r", fileobj=None, max_cache=100):
523 # TODO - Assuming we can seek, check for 28 bytes EOF empty block 524 # and if missing warn about possible truncation (as in samtools)? 525 if max_cache < 1: 526 raise ValueError("Use max_cache with a minimum of 1") 527 # Must open the BGZF file in binary mode, but we may want to 528 # treat the contents as either text or binary (unicode or 529 # bytes under Python 3) 530 if fileobj: 531 assert filename is None 532 handle = fileobj 533 assert "b" in handle.mode.lower() 534 else: 535 if "w" in mode.lower() \ 536 or "a" in mode.lower(): 537 raise ValueError("Must use read mode (default), not write or append mode") 538 handle = _open(filename, "rb") 539 self._text = "b" not in mode.lower() 540 if self._text: 541 self._newline = "\n" 542 else: 543 self._newline = b"\n" 544 self._handle = handle 545 self.max_cache = max_cache 546 self._buffers = {} 547 self._block_start_offset = None 548 self._block_raw_length = None 549 self._load_block(handle.tell())
550
551 - def _load_block(self, start_offset=None):
552 if start_offset is None: 553 # If the file is being read sequentially, then _handle.tell() 554 # should be pointing at the start of the next block. 555 # However, if seek has been used, we can't assume that. 556 start_offset = self._block_start_offset + self._block_raw_length 557 if start_offset == self._block_start_offset: 558 self._within_block_offset = 0 559 return 560 elif start_offset in self._buffers: 561 # Already in cache 562 self._buffer, self._block_raw_length = self._buffers[start_offset] 563 self._within_block_offset = 0 564 self._block_start_offset = start_offset 565 return 566 # Must hit the disk... first check cache limits, 567 while len(self._buffers) >= self.max_cache: 568 # TODO - Implemente LRU cache removal? 569 self._buffers.popitem() 570 # Now load the block 571 handle = self._handle 572 if start_offset is not None: 573 handle.seek(start_offset) 574 self._block_start_offset = handle.tell() 575 try: 576 block_size, self._buffer = _load_bgzf_block(handle, self._text) 577 except StopIteration: 578 # EOF 579 block_size = 0 580 if self._text: 581 self._buffer = "" 582 else: 583 self._buffer = b"" 584 self._within_block_offset = 0 585 self._block_raw_length = block_size 586 # Finally save the block in our cache, 587 self._buffers[self._block_start_offset] = self._buffer, block_size
588
589 - def tell(self):
590 """Returns a 64-bit unsigned BGZF virtual offset.""" 591 if 0 < self._within_block_offset == len(self._buffer): 592 # Special case where we're right at the end of a (non empty) block. 593 # For non-maximal blocks could give two possible virtual offsets, 594 # but for a maximal block can't use 65536 as the within block 595 # offset. Therefore for consistency, use the next block and a 596 # within block offset of zero. 597 return (self._block_start_offset + self._block_raw_length) << 16 598 else: 599 # return make_virtual_offset(self._block_start_offset, 600 # self._within_block_offset) 601 # TODO - Include bounds checking as in make_virtual_offset? 602 return (self._block_start_offset << 16) | self._within_block_offset
603
604 - def seek(self, virtual_offset):
605 """Seek to a 64-bit unsigned BGZF virtual offset.""" 606 # Do this inline to avoid a function call, 607 # start_offset, within_block = split_virtual_offset(virtual_offset) 608 start_offset = virtual_offset >> 16 609 within_block = virtual_offset ^ (start_offset << 16) 610 if start_offset != self._block_start_offset: 611 # Don't need to load the block if already there 612 # (this avoids a function call since _load_block would do nothing) 613 self._load_block(start_offset) 614 assert start_offset == self._block_start_offset 615 if within_block > len(self._buffer) \ 616 and not (within_block == 0 and len(self._buffer)==0): 617 raise ValueError("Within offset %i but block size only %i" 618 % (within_block, len(self._buffer))) 619 self._within_block_offset = within_block 620 # assert virtual_offset == self.tell(), \ 621 # "Did seek to %i (%i, %i), but tell says %i (%i, %i)" \ 622 # % (virtual_offset, start_offset, within_block, 623 # self.tell(), self._block_start_offset, self._within_block_offset) 624 return virtual_offset
625
626 - def read(self, size=-1):
627 if size < 0: 628 raise NotImplementedError("Don't be greedy, that could be massive!") 629 elif size == 0: 630 if self._text: 631 return "" 632 else: 633 return b"" 634 elif self._within_block_offset + size <= len(self._buffer): 635 # This may leave us right at the end of a block 636 # (lazy loading, don't load the next block unless we have too) 637 data = self._buffer[self._within_block_offset:self._within_block_offset + size] 638 self._within_block_offset += size 639 assert data # Must be at least 1 byte 640 return data 641 else: 642 data = self._buffer[self._within_block_offset:] 643 size -= len(data) 644 self._load_block() # will reset offsets 645 # TODO - Test with corner case of an empty block followed by 646 # a non-empty block 647 if not self._buffer: 648 return data # EOF 649 elif size: 650 # TODO - Avoid recursion 651 return data + self.read(size) 652 else: 653 # Only needed the end of the last block 654 return data
655
656 - def readline(self):
657 i = self._buffer.find(self._newline, self._within_block_offset) 658 # Three cases to consider, 659 if i==-1: 660 # No newline, need to read in more data 661 data = self._buffer[self._within_block_offset:] 662 self._load_block() # will reset offsets 663 if not self._buffer: 664 return data # EOF 665 else: 666 # TODO - Avoid recursion 667 return data + self.readline() 668 elif i + 1 == len(self._buffer): 669 # Found new line, but right at end of block (SPECIAL) 670 data = self._buffer[self._within_block_offset:] 671 # Must now load the next block to ensure tell() works 672 self._load_block() # will reset offsets 673 assert data 674 return data 675 else: 676 # Found new line, not at end of block (easy case, no IO) 677 data = self._buffer[self._within_block_offset:i + 1] 678 self._within_block_offset = i + 1 679 # assert data.endswith(self._newline) 680 return data
681
682 - def __next__(self):
683 line = self.readline() 684 if not line: 685 raise StopIteration 686 return line
687 688 if sys.version_info[0] < 3:
689 - def next(self):
690 """Python 2 style alias for Python 3 style __next__ method.""" 691 return self.__next__()
692
693 - def __iter__(self):
694 return self
695
696 - def close(self):
697 self._handle.close() 698 self._buffer = None 699 self._block_start_offset = None 700 self._buffers = None
701
702 - def seekable(self):
703 return True
704
705 - def isatty(self):
706 return False
707
708 - def fileno(self):
709 return self._handle.fileno()
710
711 - def __enter__(self):
712 return self
713
714 - def __exit__(self, type, value, traceback):
715 self.close()
716 717
718 -class BgzfWriter(object):
719
720 - def __init__(self, filename=None, mode="w", fileobj=None, compresslevel=6):
721 if fileobj: 722 assert filename is None 723 handle = fileobj 724 else: 725 if "w" not in mode.lower() \ 726 and "a" not in mode.lower(): 727 raise ValueError("Must use write or append mode, not %r" % mode) 728 if "a" in mode.lower(): 729 handle = _open(filename, "ab") 730 else: 731 handle = _open(filename, "wb") 732 self._text = "b" not in mode.lower() 733 self._handle = handle 734 self._buffer = b"" 735 self.compresslevel = compresslevel
736
737 - def _write_block(self, block):
738 # print("Saving %i bytes" % len(block)) 739 start_offset = self._handle.tell() 740 assert len(block) <= 65536 741 # Giving a negative window bits means no gzip/zlib headers, -15 used in samtools 742 c = zlib.compressobj(self.compresslevel, 743 zlib.DEFLATED, 744 -15, 745 zlib.DEF_MEM_LEVEL, 746 0) 747 compressed = c.compress(block) + c.flush() 748 del c 749 assert len(compressed) < 65536, "TODO - Didn't compress enough, try less data in this block" 750 crc = zlib.crc32(block) 751 # Should cope with a mix of Python platforms... 752 if crc < 0: 753 crc = struct.pack("<i", crc) 754 else: 755 crc = struct.pack("<I", crc) 756 bsize = struct.pack("<H", len(compressed) + 25) # includes -1 757 crc = struct.pack("<I", zlib.crc32(block) & 0xffffffff) 758 uncompressed_length = struct.pack("<I", len(block)) 759 # Fixed 16 bytes, 760 # gzip magic bytes (4) mod time (4), 761 # gzip flag (1), os (1), extra length which is six (2), 762 # sub field which is BC (2), sub field length of two (2), 763 # Variable data, 764 # 2 bytes: block length as BC sub field (2) 765 # X bytes: the data 766 # 8 bytes: crc (4), uncompressed data length (4) 767 data = _bgzf_header + bsize + compressed + crc + uncompressed_length 768 self._handle.write(data)
769
770 - def write(self, data):
771 # TODO - Check bytes vs unicode 772 data = _as_bytes(data) 773 # block_size = 2**16 = 65536 774 data_len = len(data) 775 if len(self._buffer) + data_len < 65536: 776 # print("Cached %r" % data) 777 self._buffer += data 778 return 779 else: 780 # print("Got %r, writing out some data..." % data) 781 self._buffer += data 782 while len(self._buffer) >= 65536: 783 self._write_block(self._buffer[:65536]) 784 self._buffer = self._buffer[65536:]
785
786 - def flush(self):
787 while len(self._buffer) >= 65536: 788 self._write_block(self._buffer[:65535]) 789 self._buffer = self._buffer[65535:] 790 self._write_block(self._buffer) 791 self._buffer = b"" 792 self._handle.flush()
793
794 - def close(self):
795 """Flush data, write 28 bytes empty BGZF EOF marker, and close the BGZF file.""" 796 if self._buffer: 797 self.flush() 798 # samtools will look for a magic EOF marker, just a 28 byte empty BGZF block, 799 # and if it is missing warns the BAM file may be truncated. In addition to 800 # samtools writing this block, so too does bgzip - so we should too. 801 self._handle.write(_bgzf_eof) 802 self._handle.flush() 803 self._handle.close()
804
805 - def tell(self):
806 """Returns a BGZF 64-bit virtual offset.""" 807 return make_virtual_offset(self._handle.tell(), len(self._buffer))
808
809 - def seekable(self):
810 # Not seekable, but we do support tell... 811 return False
812
813 - def isatty(self):
814 return False
815
816 - def fileno(self):
817 return self._handle.fileno()
818
819 - def __enter__(self):
820 return self
821
822 - def __exit__(self, type, value, traceback):
823 self.close()
824 825 826 if __name__ == "__main__": 827 import sys 828 if len(sys.argv) > 1: 829 print("Call this with no arguments and pipe uncompressed data in on stdin") 830 print("and it will produce BGZF compressed data on stdout. e.g.") 831 print("") 832 print("./bgzf.py < example.fastq > example.fastq.bgz") 833 print("") 834 print("The extension convention of *.bgz is to distinugish these from *.gz") 835 print("used for standard gzipped files without the block structure of BGZF.") 836 print("You can use the standard gunzip command to decompress BGZF files,") 837 print("if it complains about the extension try something like this:") 838 print("") 839 print("cat example.fastq.bgz | gunzip > example.fastq") 840 print("") 841 print("See also the tool bgzip that comes with samtools") 842 sys.exit(0) 843 844 sys.stderr.write("Producing BGZF output from stdin...\n") 845 w = BgzfWriter(fileobj=sys.stdout) 846 while True: 847 data = sys.stdin.read(65536) 848 w.write(data) 849 if not data: 850 break 851 # Doing close with write an empty BGZF block as EOF marker: 852 w.close() 853 sys.stderr.write("BGZF data produced\n") 854