Package Bio :: Package SeqIO :: Module SffIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.SffIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # Based on code contributed and copyright 2009 by Jose Blanca (COMAV-UPV). 
   3  # 
   4  # This code is part of the Biopython distribution and governed by its 
   5  # license.  Please see the LICENSE file that should have been included 
   6  # as part of this package. 
   7  """Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format. 
   8   
   9  SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for 
  10  Biomedical Research and the Wellcome Trust Sanger Institute. SFF was also used 
  11  as the native output format from early versions of Ion Torrent's PGM platform 
  12  as well. You are expected to use this module via the Bio.SeqIO functions under 
  13  the format name "sff" (or "sff-trim" as described below). 
  14   
  15  For example, to iterate over the records in an SFF file, 
  16   
  17      >>> from Bio import SeqIO 
  18      >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"): 
  19      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
  20      ... 
  21      E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT... 
  22      E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA... 
  23      E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC... 
  24      E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT... 
  25      E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA... 
  26      E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC... 
  27      E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA... 
  28      E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG... 
  29      E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC... 
  30      E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA... 
  31   
  32  Each SeqRecord object will contain all the annotation from the SFF file, 
  33  including the PHRED quality scores. 
  34   
  35      >>> print("%s %i" % (record.id, len(record))) 
  36      E3MFGYR02F7Z7G 219 
  37      >>> print("%s..." % record.seq[:10]) 
  38      tcagAATCAT... 
  39      >>> print("%r..." % (record.letter_annotations["phred_quality"][:10])) 
  40      [22, 21, 23, 28, 26, 15, 12, 21, 28, 21]... 
  41   
  42  Notice that the sequence is given in mixed case, the central upper case region 
  43  corresponds to the trimmed sequence. This matches the output of the Roche 
  44  tools (and the 3rd party tool sff_extract) for SFF to FASTA. 
  45   
  46      >>> print(record.annotations["clip_qual_left"]) 
  47      4 
  48      >>> print(record.annotations["clip_qual_right"]) 
  49      134 
  50      >>> print(record.seq[:4]) 
  51      tcag 
  52      >>> print("%s...%s" % (record.seq[4:20], record.seq[120:134])) 
  53      AATCATCCACTTTTTA...CAAAACACAAACAG 
  54      >>> print(record.seq[134:]) 
  55      atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn 
  56   
  57  The annotations dictionary also contains any adapter clip positions 
  58  (usually zero), and information about the flows. e.g. 
  59   
  60      >>> len(record.annotations) 
  61      11 
  62      >>> print(record.annotations["flow_key"]) 
  63      TCAG 
  64      >>> print(record.annotations["flow_values"][:10]) 
  65      (83, 1, 128, 7, 4, 84, 6, 106, 3, 172) 
  66      >>> print(len(record.annotations["flow_values"])) 
  67      400 
  68      >>> print(record.annotations["flow_index"][:10]) 
  69      (1, 2, 3, 2, 2, 0, 3, 2, 3, 3) 
  70      >>> print(len(record.annotations["flow_index"])) 
  71      219 
  72   
  73  Note that to convert from a raw reading in flow_values to the corresponding 
  74  homopolymer stretch estimate, the value should be rounded to the nearest 100: 
  75   
  76      >>> print("%r..." % [int(round(value, -2)) // 100 
  77      ...                  for value in record.annotations["flow_values"][:10]]) 
  78      ... 
  79      [1, 0, 1, 0, 0, 1, 0, 1, 0, 2]... 
  80   
  81  If a read name is exactly 14 alphanumeric characters, the annotations 
  82  dictionary will also contain meta-data about the read extracted by 
  83  interpretting the name as a 454 Sequencing System "Universal" Accession 
  84  Number. Note that if a read name happens to be exactly 14 alphanumeric 
  85  characters but was not generated automatically, these annotation records 
  86  will contain nonsense information. 
  87   
  88      >>> print(record.annotations["region"]) 
  89      2 
  90      >>> print(record.annotations["time"]) 
  91      [2008, 1, 9, 16, 16, 0] 
  92      >>> print(record.annotations["coords"]) 
  93      (2434, 1658) 
  94   
  95  As a convenience method, you can read the file with SeqIO format name "sff-trim" 
  96  instead of "sff" to get just the trimmed sequences (without any annotation 
  97  except for the PHRED quality scores and anything encoded in the read names): 
  98   
  99      >>> from Bio import SeqIO 
 100      >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim"): 
 101      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 102      ... 
 103      E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC... 
 104      E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC... 
 105      E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG... 
 106      E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT... 
 107      E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA... 
 108      E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC... 
 109      E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC... 
 110      E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC... 
 111      E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG... 
 112      E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT... 
 113   
 114  Looking at the final record in more detail, note how this differs to the 
 115  example above: 
 116   
 117      >>> print("%s %i" % (record.id, len(record))) 
 118      E3MFGYR02F7Z7G 130 
 119      >>> print("%s..." % record.seq[:10]) 
 120      AATCATCCAC... 
 121      >>> print("%r..." % record.letter_annotations["phred_quality"][:10]) 
 122      [26, 15, 12, 21, 28, 21, 36, 28, 27, 27]... 
 123      >>> len(record.annotations) 
 124      3 
 125      >>> print(record.annotations["region"]) 
 126      2 
 127      >>> print(record.annotations["coords"]) 
 128      (2434, 1658) 
 129      >>> print(record.annotations["time"]) 
 130      [2008, 1, 9, 16, 16, 0] 
 131   
 132  You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF 
 133  reads into a FASTQ file (or a FASTA file and a QUAL file), e.g. 
 134   
 135      >>> from Bio import SeqIO 
 136      >>> try: 
 137      ...     from StringIO import StringIO # Python 2 
 138      ... except ImportError: 
 139      ...     from io import StringIO # Python 3 
 140      ... 
 141      >>> out_handle = StringIO() 
 142      >>> count = SeqIO.convert("Roche/E3MFGYR02_random_10_reads.sff", "sff", 
 143      ...                       out_handle, "fastq") 
 144      ... 
 145      >>> print("Converted %i records" % count) 
 146      Converted 10 records 
 147   
 148  The output FASTQ file would start like this: 
 149   
 150      >>> print("%s..." % out_handle.getvalue()[:50]) 
 151      @E3MFGYR02JWQ7T 
 152      tcagGGTCTACATGTTGGTTAACCCGTACTGATT... 
 153   
 154  Bio.SeqIO.index() provides memory efficient random access to the reads in an 
 155  SFF file by name. SFF files can include an index within the file, which can 
 156  be read in making this very fast. If the index is missing (or in a format not 
 157  yet supported in Biopython) the file is indexed by scanning all the reads - 
 158  which is a little slower. For example, 
 159   
 160      >>> from Bio import SeqIO 
 161      >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 162      >>> record = reads["E3MFGYR02JHD4H"] 
 163      >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 164      E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC... 
 165      >>> reads.close() 
 166   
 167  Or, using the trimmed reads: 
 168   
 169      >>> from Bio import SeqIO 
 170      >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim") 
 171      >>> record = reads["E3MFGYR02JHD4H"] 
 172      >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 173      E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG... 
 174      >>> reads.close() 
 175   
 176  You can also use the Bio.SeqIO.write() function with the "sff" format. Note 
 177  that this requires all the flow information etc, and thus is probably only 
 178  useful for SeqRecord objects originally from reading another SFF file (and 
 179  not the trimmed SeqRecord objects from parsing an SFF file as "sff-trim"). 
 180   
 181  As an example, let's pretend this example SFF file represents some DNA which 
 182  was pre-amplified with a PCR primers AAAGANNNNN. The following script would 
 183  produce a sub-file containing all those reads whose post-quality clipping 
 184  region (i.e. the sequence after trimming) starts with AAAGA exactly (the non- 
 185  degenerate bit of this pretend primer): 
 186   
 187      >>> from Bio import SeqIO 
 188      >>> records = (record for record in 
 189      ...            SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 190      ...            if record.seq[record.annotations["clip_qual_left"]:].startswith("AAAGA")) 
 191      ... 
 192      >>> count = SeqIO.write(records, "temp_filtered.sff", "sff") 
 193      >>> print("Selected %i records" % count) 
 194      Selected 2 records 
 195   
 196  Of course, for an assembly you would probably want to remove these primers. 
 197  If you want FASTA or FASTQ output, you could just slice the SeqRecord. However, 
 198  if you want SFF output we have to preserve all the flow information - the trick 
 199  is just to adjust the left clip position! 
 200   
 201      >>> from Bio import SeqIO 
 202      >>> def filter_and_trim(records, primer): 
 203      ...     for record in records: 
 204      ...         if record.seq[record.annotations["clip_qual_left"]:].startswith(primer): 
 205      ...             record.annotations["clip_qual_left"] += len(primer) 
 206      ...             yield record 
 207      ... 
 208      >>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 209      >>> count = SeqIO.write(filter_and_trim(records, "AAAGA"), 
 210      ...                     "temp_filtered.sff", "sff") 
 211      ... 
 212      >>> print("Selected %i records" % count) 
 213      Selected 2 records 
 214   
 215  We can check the results, note the lower case clipped region now includes the "AAAGA" 
 216  sequence: 
 217   
 218      >>> for record in SeqIO.parse("temp_filtered.sff", "sff"): 
 219      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 220      ... 
 221      E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC... 
 222      E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA... 
 223      >>> for record in SeqIO.parse("temp_filtered.sff", "sff-trim"): 
 224      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 225      ... 
 226      E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG... 
 227      E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG... 
 228      >>> import os 
 229      >>> os.remove("temp_filtered.sff") 
 230   
 231  For a description of the file format, please see the Roche manuals and: 
 232  http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats 
 233   
 234  """ 
 235   
 236  from __future__ import print_function 
 237   
 238  from Bio.SeqIO.Interfaces import SequenceWriter 
 239  from Bio import Alphabet 
 240  from Bio.Seq import Seq 
 241  from Bio.SeqRecord import SeqRecord 
 242  import struct 
 243  import sys 
 244  import re 
 245   
 246  from Bio._py3k import _bytes_to_string, _as_bytes 
 247  _null = b"\0" 
 248  _sff = b".sff" 
 249  _hsh = b".hsh" 
 250  _srt = b".srt" 
 251  _mft = b".mft" 
 252  _flag = b"\xff" 
 253   
 254   
255 -def _sff_file_header(handle):
256 """Read in an SFF file header (PRIVATE). 257 258 Assumes the handle is at the start of the file, will read forwards 259 though the header and leave the handle pointing at the first record. 260 Returns a tuple of values from the header (header_length, index_offset, 261 index_length, number_of_reads, flows_per_read, flow_chars, key_sequence) 262 263 >>> with open("Roche/greek.sff", "rb") as handle: 264 ... values = _sff_file_header(handle) 265 ... 266 >>> print(values[0]) 267 840 268 >>> print(values[1]) 269 65040 270 >>> print(values[2]) 271 256 272 >>> print(values[3]) 273 24 274 >>> print(values[4]) 275 800 276 >>> values[-1] 277 'TCAG' 278 279 """ 280 if hasattr(handle, "mode") and "U" in handle.mode.upper(): 281 raise ValueError("SFF files must NOT be opened in universal new " 282 "lines mode. Binary mode is recommended (although " 283 "on Unix the default mode is also fine).") 284 elif hasattr(handle, "mode") and "B" not in handle.mode.upper() \ 285 and sys.platform == "win32": 286 raise ValueError("SFF files must be opened in binary mode on Windows") 287 #file header (part one) 288 #use big endiean encdoing > 289 #magic_number I 290 #version 4B 291 #index_offset Q 292 #index_length I 293 #number_of_reads I 294 #header_length H 295 #key_length H 296 #number_of_flows_per_read H 297 #flowgram_format_code B 298 #[rest of file header depends on the number of flows and how many keys] 299 fmt = '>4s4BQIIHHHB' 300 assert 31 == struct.calcsize(fmt) 301 data = handle.read(31) 302 if not data: 303 raise ValueError("Empty file.") 304 elif len(data) < 13: 305 raise ValueError("File too small to hold a valid SFF header.") 306 magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ 307 number_of_reads, header_length, key_length, number_of_flows_per_read, \ 308 flowgram_format = struct.unpack(fmt, data) 309 if magic_number in [_hsh, _srt, _mft]: 310 #Probably user error, calling Bio.SeqIO.parse() twice! 311 raise ValueError("Handle seems to be at SFF index block, not start") 312 if magic_number != _sff: # 779314790 313 raise ValueError("SFF file did not start '.sff', but %s" 314 % repr(magic_number)) 315 if (ver0, ver1, ver2, ver3) != (0, 0, 0, 1): 316 raise ValueError("Unsupported SFF version in header, %i.%i.%i.%i" 317 % (ver0, ver1, ver2, ver3)) 318 if flowgram_format != 1: 319 raise ValueError("Flowgram format code %i not supported" 320 % flowgram_format) 321 if (index_offset != 0) ^ (index_length != 0): 322 raise ValueError("Index offset %i but index length %i" 323 % (index_offset, index_length)) 324 flow_chars = _bytes_to_string(handle.read(number_of_flows_per_read)) 325 key_sequence = _bytes_to_string(handle.read(key_length)) 326 #According to the spec, the header_length field should be the total number 327 #of bytes required by this set of header fields, and should be equal to 328 #"31 + number_of_flows_per_read + key_length" rounded up to the next value 329 #divisible by 8. 330 assert header_length % 8 == 0 331 padding = header_length - number_of_flows_per_read - key_length - 31 332 assert 0 <= padding < 8, padding 333 if handle.read(padding).count(_null) != padding: 334 import warnings 335 from Bio import BiopythonParserWarning 336 warnings.warn("Your SFF file is invalid, post header %i byte " 337 "null padding region contained data." % padding, 338 BiopythonParserWarning) 339 return header_length, index_offset, index_length, \ 340 number_of_reads, number_of_flows_per_read, \ 341 flow_chars, key_sequence
342 343
344 -def _sff_do_slow_index(handle):
345 """Generates an index by scanning though all the reads in an SFF file (PRIVATE). 346 347 This is a slow but generic approach if we can't parse the provided index 348 (if present). 349 350 Will use the handle seek/tell functions. 351 """ 352 handle.seek(0) 353 header_length, index_offset, index_length, number_of_reads, \ 354 number_of_flows_per_read, flow_chars, key_sequence \ 355 = _sff_file_header(handle) 356 #Now on to the reads... 357 read_header_fmt = '>2HI4H' 358 read_header_size = struct.calcsize(read_header_fmt) 359 #NOTE - assuming flowgram_format==1, which means struct type H 360 read_flow_fmt = ">%iH" % number_of_flows_per_read 361 read_flow_size = struct.calcsize(read_flow_fmt) 362 assert 1 == struct.calcsize(">B") 363 assert 1 == struct.calcsize(">s") 364 assert 1 == struct.calcsize(">c") 365 assert read_header_size % 8 == 0 # Important for padding calc later! 366 for read in range(number_of_reads): 367 record_offset = handle.tell() 368 if record_offset == index_offset: 369 #Found index block within reads, ignore it: 370 offset = index_offset + index_length 371 if offset % 8: 372 offset += 8 - (offset % 8) 373 assert offset % 8 == 0 374 handle.seek(offset) 375 record_offset = offset 376 #assert record_offset%8 == 0 #Worth checking, but slow 377 #First the fixed header 378 data = handle.read(read_header_size) 379 read_header_length, name_length, seq_len, clip_qual_left, \ 380 clip_qual_right, clip_adapter_left, clip_adapter_right \ 381 = struct.unpack(read_header_fmt, data) 382 if read_header_length < 10 or read_header_length % 8 != 0: 383 raise ValueError("Malformed read header, says length is %i:\n%s" 384 % (read_header_length, repr(data))) 385 #now the name and any padding (remainder of header) 386 name = _bytes_to_string(handle.read(name_length)) 387 padding = read_header_length - read_header_size - name_length 388 if handle.read(padding).count(_null) != padding: 389 import warnings 390 from Bio import BiopythonParserWarning 391 warnings.warn("Your SFF file is invalid, post name %i byte " 392 "padding region contained data" % padding, 393 BiopythonParserWarning) 394 assert record_offset + read_header_length == handle.tell() 395 #now the flowgram values, flowgram index, bases and qualities 396 size = read_flow_size + 3 * seq_len 397 handle.seek(size, 1) 398 #now any padding... 399 padding = size % 8 400 if padding: 401 padding = 8 - padding 402 if handle.read(padding).count(_null) != padding: 403 import warnings 404 from Bio import BiopythonParserWarning 405 warnings.warn("Your SFF file is invalid, post quality %i " 406 "byte padding region contained data" % padding, 407 BiopythonParserWarning) 408 #print("%s %s %i" % (read, name, record_offset)) 409 yield name, record_offset 410 if handle.tell() % 8 != 0: 411 raise ValueError( 412 "After scanning reads, did not end on a multiple of 8")
413 414
415 -def _sff_find_roche_index(handle):
416 """Locate any existing Roche style XML meta data and read index (PRIVATE). 417 418 Makes a number of hard coded assumptions based on reverse engineered SFF 419 files from Roche 454 machines. 420 421 Returns a tuple of read count, SFF "index" offset and size, XML offset 422 and size, and the actual read index offset and size. 423 424 Raises a ValueError for unsupported or non-Roche index blocks. 425 """ 426 handle.seek(0) 427 header_length, index_offset, index_length, number_of_reads, \ 428 number_of_flows_per_read, flow_chars, key_sequence \ 429 = _sff_file_header(handle) 430 assert handle.tell() == header_length 431 if not index_offset or not index_offset: 432 raise ValueError("No index present in this SFF file") 433 #Now jump to the header... 434 handle.seek(index_offset) 435 fmt = ">4s4B" 436 fmt_size = struct.calcsize(fmt) 437 data = handle.read(fmt_size) 438 if not data: 439 raise ValueError("Premature end of file? Expected index of size %i at offest %i, found nothing" 440 % (index_length, index_offset)) 441 if len(data) < fmt_size: 442 raise ValueError("Premature end of file? Expected index of size %i at offest %i, found %s" 443 % (index_length, index_offset, repr(data))) 444 magic_number, ver0, ver1, ver2, ver3 = struct.unpack(fmt, data) 445 if magic_number == _mft: # 778921588 446 #Roche 454 manifest index 447 #This is typical from raw Roche 454 SFF files (2009), and includes 448 #both an XML manifest and the sorted index. 449 if (ver0, ver1, ver2, ver3) != (49, 46, 48, 48): 450 #This is "1.00" as a string 451 raise ValueError("Unsupported version in .mft index header, %i.%i.%i.%i" 452 % (ver0, ver1, ver2, ver3)) 453 fmt2 = ">LL" 454 fmt2_size = struct.calcsize(fmt2) 455 xml_size, data_size = struct.unpack(fmt2, handle.read(fmt2_size)) 456 if index_length != fmt_size + fmt2_size + xml_size + data_size: 457 raise ValueError("Problem understanding .mft index header, %i != %i + %i + %i + %i" 458 % (index_length, fmt_size, fmt2_size, xml_size, data_size)) 459 return number_of_reads, header_length, \ 460 index_offset, index_length, \ 461 index_offset + fmt_size + fmt2_size, xml_size, \ 462 index_offset + fmt_size + fmt2_size + xml_size, data_size 463 elif magic_number == _srt: # 779317876 464 #Roche 454 sorted index 465 #I've had this from Roche tool sfffile when the read identifiers 466 #had nonstandard lengths and there was no XML manifest. 467 if (ver0, ver1, ver2, ver3) != (49, 46, 48, 48): 468 #This is "1.00" as a string 469 raise ValueError("Unsupported version in .srt index header, %i.%i.%i.%i" 470 % (ver0, ver1, ver2, ver3)) 471 data = handle.read(4) 472 if data != _null * 4: 473 raise ValueError( 474 "Did not find expected null four bytes in .srt index") 475 return number_of_reads, header_length, \ 476 index_offset, index_length, \ 477 0, 0, \ 478 index_offset + fmt_size + 4, index_length - fmt_size - 4 479 elif magic_number == _hsh: 480 raise ValueError("Hash table style indexes (.hsh) in SFF files are " 481 "not (yet) supported") 482 else: 483 raise ValueError("Unknown magic number %s in SFF index header:\n%s" 484 % (repr(magic_number), repr(data)))
485 486
487 -def ReadRocheXmlManifest(handle):
488 """Reads any Roche style XML manifest data in the SFF "index". 489 490 The SFF file format allows for multiple different index blocks, and Roche 491 took advantage of this to define their own index block which also embeds 492 an XML manifest string. This is not a publically documented extension to 493 the SFF file format, this was reverse engineered. 494 495 The handle should be to an SFF file opened in binary mode. This function 496 will use the handle seek/tell functions and leave the handle in an 497 arbitrary location. 498 499 Any XML manifest found is returned as a Python string, which you can then 500 parse as appropriate, or reuse when writing out SFF files with the 501 SffWriter class. 502 503 Returns a string, or raises a ValueError if an Roche manifest could not be 504 found. 505 """ 506 number_of_reads, header_length, index_offset, index_length, xml_offset, \ 507 xml_size, read_index_offset, read_index_size = _sff_find_roche_index( 508 handle) 509 if not xml_offset or not xml_size: 510 raise ValueError("No XML manifest found") 511 handle.seek(xml_offset) 512 return _bytes_to_string(handle.read(xml_size))
513 514 515 #This is a generator function!
516 -def _sff_read_roche_index(handle):
517 """Reads any existing Roche style read index provided in the SFF file (PRIVATE). 518 519 Will use the handle seek/tell functions. 520 521 This works on ".srt1.00" and ".mft1.00" style Roche SFF index blocks. 522 523 Roche SFF indices use base 255 not 256, meaning we see bytes in range the 524 range 0 to 254 only. This appears to be so that byte 0xFF (character 255) 525 can be used as a marker character to separate entries (required if the 526 read name lengths vary). 527 528 Note that since only four bytes are used for the read offset, this is 529 limited to 255^4 bytes (nearly 4GB). If you try to use the Roche sfffile 530 tool to combine SFF files beyound this limit, they issue a warning and 531 omit the index (and manifest). 532 """ 533 number_of_reads, header_length, index_offset, index_length, xml_offset, \ 534 xml_size, read_index_offset, read_index_size = _sff_find_roche_index( 535 handle) 536 #Now parse the read index... 537 handle.seek(read_index_offset) 538 fmt = ">5B" 539 for read in range(number_of_reads): 540 #TODO - Be more aware of when the index should end? 541 data = handle.read(6) 542 while True: 543 more = handle.read(1) 544 if not more: 545 raise ValueError("Premature end of file!") 546 data += more 547 if more == _flag: 548 break 549 assert data[-1:] == _flag, data[-1:] 550 name = _bytes_to_string(data[:-6]) 551 off4, off3, off2, off1, off0 = struct.unpack(fmt, data[-6:-1]) 552 offset = off0 + 255 * off1 + 65025 * off2 + 16581375 * off3 553 if off4: 554 #Could in theory be used as a fifth piece of offset information, 555 #i.e. offset =+ 4228250625L*off4, but testing the Roche tools this 556 #is not the case. They simple don't support such large indexes. 557 raise ValueError("Expected a null terminator to the read name.") 558 yield name, offset 559 if handle.tell() != read_index_offset + read_index_size: 560 raise ValueError("Problem with index length? %i vs %i" 561 % (handle.tell(), read_index_offset + read_index_size))
562 563 _valid_UAN_read_name = re.compile(r'^[a-zA-Z0-9]{14}$') 564 565
566 -def _sff_read_seq_record(handle, number_of_flows_per_read, flow_chars, 567 key_sequence, alphabet, trim=False):
568 """Parse the next read in the file, return data as a SeqRecord (PRIVATE).""" 569 #Now on to the reads... 570 #the read header format (fixed part): 571 #read_header_length H 572 #name_length H 573 #seq_len I 574 #clip_qual_left H 575 #clip_qual_right H 576 #clip_adapter_left H 577 #clip_adapter_right H 578 #[rest of read header depends on the name length etc] 579 read_header_fmt = '>2HI4H' 580 read_header_size = struct.calcsize(read_header_fmt) 581 read_flow_fmt = ">%iH" % number_of_flows_per_read 582 read_flow_size = struct.calcsize(read_flow_fmt) 583 584 read_header_length, name_length, seq_len, clip_qual_left, \ 585 clip_qual_right, clip_adapter_left, clip_adapter_right \ 586 = struct.unpack(read_header_fmt, handle.read(read_header_size)) 587 if clip_qual_left: 588 clip_qual_left -= 1 # python counting 589 if clip_adapter_left: 590 clip_adapter_left -= 1 # python counting 591 if read_header_length < 10 or read_header_length % 8 != 0: 592 raise ValueError("Malformed read header, says length is %i" 593 % read_header_length) 594 #now the name and any padding (remainder of header) 595 name = _bytes_to_string(handle.read(name_length)) 596 padding = read_header_length - read_header_size - name_length 597 if handle.read(padding).count(_null) != padding: 598 import warnings 599 from Bio import BiopythonParserWarning 600 warnings.warn("Your SFF file is invalid, post name %i " 601 "byte padding region contained data" % padding, 602 BiopythonParserWarning) 603 #now the flowgram values, flowgram index, bases and qualities 604 #NOTE - assuming flowgram_format==1, which means struct type H 605 flow_values = handle.read(read_flow_size) # unpack later if needed 606 temp_fmt = ">%iB" % seq_len # used for flow index and quals 607 flow_index = handle.read(seq_len) # unpack later if needed 608 seq = _bytes_to_string(handle.read(seq_len)) # TODO - Use bytes in Seq? 609 quals = list(struct.unpack(temp_fmt, handle.read(seq_len))) 610 #now any padding... 611 padding = (read_flow_size + seq_len * 3) % 8 612 if padding: 613 padding = 8 - padding 614 if handle.read(padding).count(_null) != padding: 615 import warnings 616 from Bio import BiopythonParserWarning 617 warnings.warn("Your SFF file is invalid, post quality %i " 618 "byte padding region contained data" % padding, 619 BiopythonParserWarning) 620 #Follow Roche and apply most aggressive of qual and adapter clipping. 621 #Note Roche seems to ignore adapter clip fields when writing SFF, 622 #and uses just the quality clipping values for any clipping. 623 clip_left = max(clip_qual_left, clip_adapter_left) 624 #Right clipping of zero means no clipping 625 if clip_qual_right: 626 if clip_adapter_right: 627 clip_right = min(clip_qual_right, clip_adapter_right) 628 else: 629 #Typical case with Roche SFF files 630 clip_right = clip_qual_right 631 elif clip_adapter_right: 632 clip_right = clip_adapter_right 633 else: 634 clip_right = seq_len 635 #Now build a SeqRecord 636 if trim: 637 if clip_left >= clip_right: 638 # Raise an error? 639 import warnings 640 from Bio import BiopythonParserWarning 641 warnings.warn("Overlapping clip values in SFF record, trimmed to nothing", 642 BiopythonParserWarning) 643 seq = "" 644 quals = [] 645 else: 646 seq = seq[clip_left:clip_right].upper() 647 quals = quals[clip_left:clip_right] 648 #Don't record the clipping values, flow etc, they make no sense now: 649 annotations = {} 650 else: 651 if clip_left >= clip_right: 652 import warnings 653 from Bio import BiopythonParserWarning 654 warnings.warn("Overlapping clip values in SFF record", BiopythonParserWarning) 655 seq = seq.lower() 656 else: 657 #This use of mixed case mimics the Roche SFF tool's FASTA output 658 seq = seq[:clip_left].lower() + \ 659 seq[clip_left:clip_right].upper() + \ 660 seq[clip_right:].lower() 661 annotations = {"flow_values": struct.unpack(read_flow_fmt, flow_values), 662 "flow_index": struct.unpack(temp_fmt, flow_index), 663 "flow_chars": flow_chars, 664 "flow_key": key_sequence, 665 "clip_qual_left": clip_qual_left, 666 "clip_qual_right": clip_qual_right, 667 "clip_adapter_left": clip_adapter_left, 668 "clip_adapter_right": clip_adapter_right} 669 if re.match(_valid_UAN_read_name, name): 670 annotations["time"] = _get_read_time(name) 671 annotations["region"] = _get_read_region(name) 672 annotations["coords"] = _get_read_xy(name) 673 record = SeqRecord(Seq(seq, alphabet), 674 id=name, 675 name=name, 676 description="", 677 annotations=annotations) 678 #Dirty trick to speed up this line: 679 #record.letter_annotations["phred_quality"] = quals 680 dict.__setitem__(record._per_letter_annotations, 681 "phred_quality", quals) 682 #Return the record and then continue... 683 return record
684 685 _powers_of_36 = [36 ** i for i in range(6)] 686 687
688 -def _string_as_base_36(string):
689 """Interpret a string as a base-36 number as per 454 manual.""" 690 total = 0 691 for c, power in zip(string[::-1], _powers_of_36): 692 # For reference: ord('0') = 48, ord('9') = 57 693 # For reference: ord('A') = 65, ord('Z') = 90 694 # For reference: ord('a') = 97, ord('z') = 122 695 if 48 <= ord(c) <= 57: 696 val = ord(c) - 22 # equivalent to: - ord('0') + 26 697 elif 65 <= ord(c) <= 90: 698 val = ord(c) - 65 699 elif 97 <= ord(c) <= 122: 700 val = ord(c) - 97 701 else: 702 # Invalid character 703 val = 0 704 total += val * power 705 return total
706 707
708 -def _get_read_xy(read_name):
709 """Extract coordinates from last 5 characters of read name.""" 710 number = _string_as_base_36(read_name[9:]) 711 return divmod(number, 4096)
712 713 _time_denominators = [13 * 32 * 24 * 60 * 60, 714 32 * 24 * 60 * 60, 715 24 * 60 * 60, 716 60 * 60, 717 60] 718 719
720 -def _get_read_time(read_name):
721 """Extract time from first 6 characters of read name.""" 722 time_list = [] 723 remainder = _string_as_base_36(read_name[:6]) 724 for denominator in _time_denominators: 725 this_term, remainder = divmod(remainder, denominator) 726 time_list.append(this_term) 727 time_list.append(remainder) 728 time_list[0] += 2000 729 return time_list
730 731
732 -def _get_read_region(read_name):
733 """Extract region from read name.""" 734 return int(read_name[8])
735 736
737 -def _sff_read_raw_record(handle, number_of_flows_per_read):
738 """Extract the next read in the file as a raw (bytes) string (PRIVATE).""" 739 read_header_fmt = '>2HI' 740 read_header_size = struct.calcsize(read_header_fmt) 741 read_flow_fmt = ">%iH" % number_of_flows_per_read 742 read_flow_size = struct.calcsize(read_flow_fmt) 743 744 raw = handle.read(read_header_size) 745 read_header_length, name_length, seq_len \ 746 = struct.unpack(read_header_fmt, raw) 747 if read_header_length < 10 or read_header_length % 8 != 0: 748 raise ValueError("Malformed read header, says length is %i" 749 % read_header_length) 750 #now the four clip values (4H = 8 bytes), and read name 751 raw += handle.read(8 + name_length) 752 #and any padding (remainder of header) 753 padding = read_header_length - read_header_size - 8 - name_length 754 pad = handle.read(padding) 755 if pad.count(_null) != padding: 756 import warnings 757 from Bio import BiopythonParserWarning 758 warnings.warn("Your SFF file is invalid, post name %i " 759 "byte padding region contained data" % padding, 760 BiopythonParserWarning) 761 raw += pad 762 #now the flowgram values, flowgram index, bases and qualities 763 raw += handle.read(read_flow_size + seq_len * 3) 764 padding = (read_flow_size + seq_len * 3) % 8 765 #now any padding... 766 if padding: 767 padding = 8 - padding 768 pad = handle.read(padding) 769 if pad.count(_null) != padding: 770 import warnings 771 from Bio import BiopythonParserWarning 772 warnings.warn("Your SFF file is invalid, post quality %i " 773 "byte padding region contained data" % padding, 774 BiopythonParserWarning) 775 raw += pad 776 #Return the raw bytes 777 return raw
778 779
780 -class _AddTellHandle(object):
781 """Wrapper for handles which do not support the tell method (PRIVATE). 782 783 Intended for use with things like network handles where tell (and reverse 784 seek) are not supported. The SFF file needs to track the current offset in 785 order to deal with the index block. 786 """
787 - def __init__(self, handle):
788 self._handle = handle 789 self._offset = 0
790
791 - def read(self, length):
792 data = self._handle.read(length) 793 self._offset += len(data) 794 return data
795
796 - def tell(self):
797 return self._offset
798
799 - def seek(self, offset):
800 if offset < self._offset: 801 raise RunTimeError("Can't seek backwards") 802 self._handle.read(offset - self._offset)
803
804 - def close(self):
805 return self._handle.close()
806 807 808 #This is a generator function!
809 -def SffIterator(handle, alphabet=Alphabet.generic_dna, trim=False):
810 """Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects). 811 812 handle - input file, an SFF file, e.g. from Roche 454 sequencing. 813 This must NOT be opened in universal read lines mode! 814 alphabet - optional alphabet, defaults to generic DNA. 815 trim - should the sequences be trimmed? 816 817 The resulting SeqRecord objects should match those from a paired FASTA 818 and QUAL file converted from the SFF file using the Roche 454 tool 819 ssfinfo. i.e. The sequence will be mixed case, with the trim regions 820 shown in lower case. 821 822 This function is used internally via the Bio.SeqIO functions: 823 824 >>> from Bio import SeqIO 825 >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"): 826 ... print("%s %i" % (record.id, len(record))) 827 ... 828 E3MFGYR02JWQ7T 265 829 E3MFGYR02JA6IL 271 830 E3MFGYR02JHD4H 310 831 E3MFGYR02GFKUC 299 832 E3MFGYR02FTGED 281 833 E3MFGYR02FR9G7 261 834 E3MFGYR02GAZMS 278 835 E3MFGYR02HHZ8O 221 836 E3MFGYR02GPGB1 269 837 E3MFGYR02F7Z7G 219 838 839 You can also call it directly: 840 841 >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle: 842 ... for record in SffIterator(handle): 843 ... print("%s %i" % (record.id, len(record))) 844 ... 845 E3MFGYR02JWQ7T 265 846 E3MFGYR02JA6IL 271 847 E3MFGYR02JHD4H 310 848 E3MFGYR02GFKUC 299 849 E3MFGYR02FTGED 281 850 E3MFGYR02FR9G7 261 851 E3MFGYR02GAZMS 278 852 E3MFGYR02HHZ8O 221 853 E3MFGYR02GPGB1 269 854 E3MFGYR02F7Z7G 219 855 856 Or, with the trim option: 857 858 >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle: 859 ... for record in SffIterator(handle, trim=True): 860 ... print("%s %i" % (record.id, len(record))) 861 ... 862 E3MFGYR02JWQ7T 260 863 E3MFGYR02JA6IL 265 864 E3MFGYR02JHD4H 292 865 E3MFGYR02GFKUC 295 866 E3MFGYR02FTGED 277 867 E3MFGYR02FR9G7 256 868 E3MFGYR02GAZMS 271 869 E3MFGYR02HHZ8O 150 870 E3MFGYR02GPGB1 221 871 E3MFGYR02F7Z7G 130 872 873 """ 874 if isinstance(Alphabet._get_base_alphabet(alphabet), 875 Alphabet.ProteinAlphabet): 876 raise ValueError("Invalid alphabet, SFF files do not hold proteins.") 877 if isinstance(Alphabet._get_base_alphabet(alphabet), 878 Alphabet.RNAAlphabet): 879 raise ValueError("Invalid alphabet, SFF files do not hold RNA.") 880 try: 881 assert 0 == handle.tell(), "Not at start of file, offset %i" % handle.tell() 882 except AttributeError: 883 #Probably a network handle or something like that 884 handle = _AddTellHandle(handle) 885 header_length, index_offset, index_length, number_of_reads, \ 886 number_of_flows_per_read, flow_chars, key_sequence \ 887 = _sff_file_header(handle) 888 #Now on to the reads... 889 #the read header format (fixed part): 890 #read_header_length H 891 #name_length H 892 #seq_len I 893 #clip_qual_left H 894 #clip_qual_right H 895 #clip_adapter_left H 896 #clip_adapter_right H 897 #[rest of read header depends on the name length etc] 898 read_header_fmt = '>2HI4H' 899 read_header_size = struct.calcsize(read_header_fmt) 900 read_flow_fmt = ">%iH" % number_of_flows_per_read 901 read_flow_size = struct.calcsize(read_flow_fmt) 902 assert 1 == struct.calcsize(">B") 903 assert 1 == struct.calcsize(">s") 904 assert 1 == struct.calcsize(">c") 905 assert read_header_size % 8 == 0 # Important for padding calc later! 906 #The spec allows for the index block to be before or even in the middle 907 #of the reads. We can check that if we keep track of our position 908 #in the file... 909 for read in range(number_of_reads): 910 if index_offset and handle.tell() == index_offset: 911 offset = index_offset + index_length 912 if offset % 8: 913 offset += 8 - (offset % 8) 914 assert offset % 8 == 0 915 handle.seek(offset) 916 #Now that we've done this, we don't need to do it again. Clear 917 #the index_offset so we can skip extra handle.tell() calls: 918 index_offset = 0 919 yield _sff_read_seq_record(handle, 920 number_of_flows_per_read, 921 flow_chars, 922 key_sequence, 923 alphabet, 924 trim) 925 _check_eof(handle, index_offset, index_length)
926 927
928 -def _check_eof(handle, index_offset, index_length):
929 """Check final padding is OK (8 byte alignment) and file ends (PRIVATE). 930 931 Will attempt to spot apparent SFF file concatenation and give an error. 932 933 Will not attempt to seek, only moves the handle forward. 934 """ 935 offset = handle.tell() 936 extra = b"" 937 padding = 0 938 939 if index_offset and offset <= index_offset: 940 # Index block then end of file... 941 if offset < index_offset: 942 raise ValueError("Gap of %i bytes after final record end %i, " 943 "before %i where index starts?" 944 % (index_offset - offset, offset, index_offset)) 945 # Doing read to jump the index rather than a seek 946 # in case this is a network handle or similar 947 handle.read(index_offset + index_length - offset) 948 offset = index_offset + index_length 949 assert offset == handle.tell(), \ 950 "Wanted %i, got %i, index is %i to %i" \ 951 % (offset, handle.tell(), index_offset, index_offset + index_length) 952 953 if offset % 8: 954 padding = 8 - (offset % 8) 955 extra = handle.read(padding) 956 957 if padding >= 4 and extra[-4:] == _sff: 958 #Seen this in one user supplied file, should have been 959 #four bytes of null padding but was actually .sff and 960 #the start of a new concatenated SFF file! 961 raise ValueError("Your SFF file is invalid, post index %i byte " 962 "null padding region ended '.sff' which could " 963 "be the start of a concatenated SFF file? " 964 "See offset %i" % (padding, offset)) 965 if padding and not extra: 966 #TODO - Is this error harmless enough to just ignore? 967 import warnings 968 from Bio import BiopythonParserWarning 969 warnings.warn("Your SFF file is technically invalid as it is missing " 970 "a terminal %i byte null padding region." % padding, 971 BiopythonParserWarning) 972 return 973 if extra.count(_null) != padding: 974 import warnings 975 from Bio import BiopythonParserWarning 976 warnings.warn("Your SFF file is invalid, post index %i byte " 977 "null padding region contained data: %r" 978 % (padding, extra), BiopythonParserWarning) 979 980 offset = handle.tell() 981 assert offset % 8 == 0, \ 982 "Wanted offset %i %% 8 = %i to be zero" % (offset, offset % 8) 983 # Should now be at the end of the file... 984 extra = handle.read(4) 985 if extra == _sff: 986 raise ValueError("Additional data at end of SFF file, " 987 "perhaps multiple SFF files concatenated? " 988 "See offset %i" % offset) 989 elif extra: 990 raise ValueError("Additional data at end of SFF file, " 991 "see offset %i" % offset)
992 993 994 #This is a generator function!
995 -def _SffTrimIterator(handle, alphabet=Alphabet.generic_dna):
996 """Iterate over SFF reads (as SeqRecord objects) with trimming (PRIVATE).""" 997 return SffIterator(handle, alphabet, trim=True)
998 999
1000 -class SffWriter(SequenceWriter):
1001 """SFF file writer.""" 1002
1003 - def __init__(self, handle, index=True, xml=None):
1004 """Creates the writer object. 1005 1006 handle - Output handle, ideally in binary write mode. 1007 index - Boolean argument, should we try and write an index? 1008 xml - Optional string argument, xml manifest to be recorded in the index 1009 block (see function ReadRocheXmlManifest for reading this data). 1010 """ 1011 if hasattr(handle, "mode") and "U" in handle.mode.upper(): 1012 raise ValueError("SFF files must NOT be opened in universal new " 1013 "lines mode. Binary mode is required") 1014 elif hasattr(handle, "mode") and "B" not in handle.mode.upper(): 1015 raise ValueError("SFF files must be opened in binary mode") 1016 self.handle = handle 1017 self._xml = xml 1018 if index: 1019 self._index = [] 1020 else: 1021 self._index = None
1022
1023 - def write_file(self, records):
1024 """Use this to write an entire file containing the given records.""" 1025 try: 1026 self._number_of_reads = len(records) 1027 except TypeError: 1028 self._number_of_reads = 0 # dummy value 1029 if not hasattr(self.handle, "seek") \ 1030 or not hasattr(self.handle, "tell"): 1031 raise ValueError("A handle with a seek/tell methods is " 1032 "required in order to record the total " 1033 "record count in the file header (once it " 1034 "is known at the end).") 1035 if self._index is not None and \ 1036 not (hasattr(self.handle, "seek") and hasattr(self.handle, "tell")): 1037 import warnings 1038 warnings.warn("A handle with a seek/tell methods is required in " 1039 "order to record an SFF index.") 1040 self._index = None 1041 self._index_start = 0 1042 self._index_length = 0 1043 if not hasattr(records, "next"): 1044 records = iter(records) 1045 #Get the first record in order to find the flow information 1046 #we will need for the header. 1047 try: 1048 record = next(records) 1049 except StopIteration: 1050 record = None 1051 if record is None: 1052 #No records -> empty SFF file (or an error)? 1053 #We can't write a header without the flow information. 1054 #return 0 1055 raise ValueError("Must have at least one sequence") 1056 try: 1057 self._key_sequence = _as_bytes(record.annotations["flow_key"]) 1058 self._flow_chars = _as_bytes(record.annotations["flow_chars"]) 1059 self._number_of_flows_per_read = len(self._flow_chars) 1060 except KeyError: 1061 raise ValueError("Missing SFF flow information") 1062 self.write_header() 1063 self.write_record(record) 1064 count = 1 1065 for record in records: 1066 self.write_record(record) 1067 count += 1 1068 if self._number_of_reads == 0: 1069 #Must go back and record the record count... 1070 offset = self.handle.tell() 1071 self.handle.seek(0) 1072 self._number_of_reads = count 1073 self.write_header() 1074 self.handle.seek(offset) # not essential? 1075 else: 1076 assert count == self._number_of_reads 1077 if self._index is not None: 1078 self._write_index() 1079 return count
1080
1081 - def _write_index(self):
1082 assert len(self._index) == self._number_of_reads 1083 handle = self.handle 1084 self._index.sort() 1085 self._index_start = handle.tell() # need for header 1086 #XML... 1087 if self._xml is not None: 1088 xml = _as_bytes(self._xml) 1089 else: 1090 from Bio import __version__ 1091 xml = "<!-- This file was output with Biopython %s -->\n" % __version__ 1092 xml += "<!-- This XML and index block attempts to mimic Roche SFF files -->\n" 1093 xml += "<!-- This file may be a combination of multiple SFF files etc -->\n" 1094 xml = _as_bytes(xml) 1095 xml_len = len(xml) 1096 #Write to the file... 1097 fmt = ">I4BLL" 1098 fmt_size = struct.calcsize(fmt) 1099 handle.write(_null * fmt_size + xml) # fill this later 1100 fmt2 = ">6B" 1101 assert 6 == struct.calcsize(fmt2) 1102 self._index.sort() 1103 index_len = 0 # don't know yet! 1104 for name, offset in self._index: 1105 #Roche files record the offsets using base 255 not 256. 1106 #See comments for parsing the index block. There may be a faster 1107 #way to code this, but we can't easily use shifts due to odd base 1108 off3 = offset 1109 off0 = off3 % 255 1110 off3 -= off0 1111 off1 = off3 % 65025 1112 off3 -= off1 1113 off2 = off3 % 16581375 1114 off3 -= off2 1115 assert offset == off0 + off1 + off2 + off3, \ 1116 "%i -> %i %i %i %i" % (offset, off0, off1, off2, off3) 1117 off3, off2, off1, off0 = off3 // 16581375, off2 // 65025, \ 1118 off1 // 255, off0 1119 assert off0 < 255 and off1 < 255 and off2 < 255 and off3 < 255, \ 1120 "%i -> %i %i %i %i" % (offset, off0, off1, off2, off3) 1121 handle.write(name + struct.pack(fmt2, 0, 1122 off3, off2, off1, off0, 255)) 1123 index_len += len(name) + 6 1124 #Note any padding in not included: 1125 self._index_length = fmt_size + xml_len + index_len # need for header 1126 #Pad out to an 8 byte boundary (although I have noticed some 1127 #real Roche SFF files neglect to do this depsite their manual 1128 #suggesting this padding should be there): 1129 if self._index_length % 8: 1130 padding = 8 - (self._index_length % 8) 1131 handle.write(_null * padding) 1132 else: 1133 padding = 0 1134 offset = handle.tell() 1135 assert offset == self._index_start + self._index_length + padding, \ 1136 "%i vs %i + %i + %i" % (offset, self._index_start, 1137 self._index_length, padding) 1138 #Must now go back and update the index header with index size... 1139 handle.seek(self._index_start) 1140 handle.write(struct.pack(fmt, 778921588, # magic number 1141 49, 46, 48, 48, # Roche index version, "1.00" 1142 xml_len, index_len) + xml) 1143 #Must now go back and update the header... 1144 handle.seek(0) 1145 self.write_header() 1146 handle.seek(offset) # not essential?
1147
1148 - def write_header(self):
1149 #Do header... 1150 key_length = len(self._key_sequence) 1151 #file header (part one) 1152 #use big endiean encdoing > 1153 #magic_number I 1154 #version 4B 1155 #index_offset Q 1156 #index_length I 1157 #number_of_reads I 1158 #header_length H 1159 #key_length H 1160 #number_of_flows_per_read H 1161 #flowgram_format_code B 1162 #[rest of file header depends on the number of flows and how many keys] 1163 fmt = '>I4BQIIHHHB%is%is' % ( 1164 self._number_of_flows_per_read, key_length) 1165 #According to the spec, the header_length field should be the total 1166 #number of bytes required by this set of header fields, and should be 1167 #equal to "31 + number_of_flows_per_read + key_length" rounded up to 1168 #the next value divisible by 8. 1169 if struct.calcsize(fmt) % 8 == 0: 1170 padding = 0 1171 else: 1172 padding = 8 - (struct.calcsize(fmt) % 8) 1173 header_length = struct.calcsize(fmt) + padding 1174 assert header_length % 8 == 0 1175 header = struct.pack(fmt, 779314790, # magic number 0x2E736666 1176 0, 0, 0, 1, # version 1177 self._index_start, self._index_length, 1178 self._number_of_reads, 1179 header_length, key_length, 1180 self._number_of_flows_per_read, 1181 1, # the only flowgram format code we support 1182 self._flow_chars, self._key_sequence) 1183 self.handle.write(header + _null * padding)
1184
1185 - def write_record(self, record):
1186 """Write a single additional record to the output file. 1187 1188 This assumes the header has been done. 1189 """ 1190 #Basics 1191 name = _as_bytes(record.id) 1192 name_len = len(name) 1193 seq = _as_bytes(str(record.seq).upper()) 1194 seq_len = len(seq) 1195 #Qualities 1196 try: 1197 quals = record.letter_annotations["phred_quality"] 1198 except KeyError: 1199 raise ValueError("Missing PHRED qualities information for %s" % record.id) 1200 #Flow 1201 try: 1202 flow_values = record.annotations["flow_values"] 1203 flow_index = record.annotations["flow_index"] 1204 if self._key_sequence != _as_bytes(record.annotations["flow_key"]) \ 1205 or self._flow_chars != _as_bytes(record.annotations["flow_chars"]): 1206 raise ValueError("Records have inconsistent SFF flow data") 1207 except KeyError: 1208 raise ValueError("Missing SFF flow information for %s" % record.id) 1209 except AttributeError: 1210 raise ValueError("Header not written yet?") 1211 #Clipping 1212 try: 1213 clip_qual_left = record.annotations["clip_qual_left"] 1214 if clip_qual_left < 0: 1215 raise ValueError("Negative SFF clip_qual_left value for %s" % record.id) 1216 if clip_qual_left: 1217 clip_qual_left += 1 1218 clip_qual_right = record.annotations["clip_qual_right"] 1219 if clip_qual_right < 0: 1220 raise ValueError("Negative SFF clip_qual_right value for %s" % record.id) 1221 clip_adapter_left = record.annotations["clip_adapter_left"] 1222 if clip_adapter_left < 0: 1223 raise ValueError("Negative SFF clip_adapter_left value for %s" % record.id) 1224 if clip_adapter_left: 1225 clip_adapter_left += 1 1226 clip_adapter_right = record.annotations["clip_adapter_right"] 1227 if clip_adapter_right < 0: 1228 raise ValueError("Negative SFF clip_adapter_right value for %s" % record.id) 1229 except KeyError: 1230 raise ValueError("Missing SFF clipping information for %s" % record.id) 1231 1232 #Capture information for index 1233 if self._index is not None: 1234 offset = self.handle.tell() 1235 #Check the position of the final record (before sort by name) 1236 #Using a four-digit base 255 number, so the upper bound is 1237 #254*(1)+254*(255)+254*(255**2)+254*(255**3) = 4228250624 1238 #or equivalently it overflows at 255**4 = 4228250625 1239 if offset > 4228250624: 1240 import warnings 1241 warnings.warn("Read %s has file offset %i, which is too large " 1242 "to store in the Roche SFF index structure. No " 1243 "index block will be recorded." % (name, offset)) 1244 #No point recoring the offsets now 1245 self._index = None 1246 else: 1247 self._index.append((name, self.handle.tell())) 1248 1249 #the read header format (fixed part): 1250 #read_header_length H 1251 #name_length H 1252 #seq_len I 1253 #clip_qual_left H 1254 #clip_qual_right H 1255 #clip_adapter_left H 1256 #clip_adapter_right H 1257 #[rest of read header depends on the name length etc] 1258 #name 1259 #flow values 1260 #flow index 1261 #sequence 1262 #padding 1263 read_header_fmt = '>2HI4H%is' % name_len 1264 if struct.calcsize(read_header_fmt) % 8 == 0: 1265 padding = 0 1266 else: 1267 padding = 8 - (struct.calcsize(read_header_fmt) % 8) 1268 read_header_length = struct.calcsize(read_header_fmt) + padding 1269 assert read_header_length % 8 == 0 1270 data = struct.pack(read_header_fmt, 1271 read_header_length, 1272 name_len, seq_len, 1273 clip_qual_left, clip_qual_right, 1274 clip_adapter_left, clip_adapter_right, 1275 name) + _null * padding 1276 assert len(data) == read_header_length 1277 #now the flowgram values, flowgram index, bases and qualities 1278 #NOTE - assuming flowgram_format==1, which means struct type H 1279 read_flow_fmt = ">%iH" % self._number_of_flows_per_read 1280 read_flow_size = struct.calcsize(read_flow_fmt) 1281 temp_fmt = ">%iB" % seq_len # used for flow index and quals 1282 data += struct.pack(read_flow_fmt, *flow_values) \ 1283 + struct.pack(temp_fmt, *flow_index) \ 1284 + seq \ 1285 + struct.pack(temp_fmt, *quals) 1286 #now any final padding... 1287 padding = (read_flow_size + seq_len * 3) % 8 1288 if padding: 1289 padding = 8 - padding 1290 self.handle.write(data + _null * padding)
1291 1292 1293 if __name__ == "__main__": 1294 print("Running quick self test") 1295 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.sff" 1296 with open(filename, "rb") as handle: 1297 metadata = ReadRocheXmlManifest(handle) 1298 with open(filename, "rb") as handle: 1299 index1 = sorted(_sff_read_roche_index(handle)) 1300 with open(filename, "rb") as handle: 1301 index2 = sorted(_sff_do_slow_index(handle)) 1302 assert index1 == index2 1303 with open(filename, "rb") as handle: 1304 assert len(index1) == len(list(SffIterator(handle))) 1305 from Bio._py3k import StringIO 1306 from io import BytesIO 1307 with open(filename, "rb") as handle: 1308 assert len(index1) == len(list(SffIterator(BytesIO(handle.read())))) 1309 1310 if sys.platform != "win32" and sys.version_info[0] < 3: 1311 #Can be lazy and treat as binary... 1312 with open(filename, "r") as handle: 1313 assert len(index1) == len(list(SffIterator(handle))) 1314 with open(filename) as handle: 1315 index2 = sorted(_sff_read_roche_index(handle)) 1316 assert index1 == index2 1317 with open(filename, "r") as handle: 1318 index2 = sorted(_sff_do_slow_index(handle)) 1319 assert index1 == index2 1320 with open(filename, "r") as handle: 1321 assert len(index1) == len(list(SffIterator(handle))) 1322 with open(filename, "r") as handle: 1323 assert len(index1) == len(list(SffIterator(BytesIO(handle.read())))) 1324 1325 with open(filename, "rb") as handle: 1326 sff = list(SffIterator(handle)) 1327 1328 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb") as handle: 1329 sff2 = list(SffIterator(handle)) 1330 assert len(sff) == len(sff2) 1331 for old, new in zip(sff, sff2): 1332 assert old.id == new.id 1333 assert str(old.seq) == str(new.seq) 1334 1335 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb") as handle: 1336 sff2 = list(SffIterator(handle)) 1337 assert len(sff) == len(sff2) 1338 for old, new in zip(sff, sff2): 1339 assert old.id == new.id 1340 assert str(old.seq) == str(new.seq) 1341 1342 with open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb") as handle: 1343 sff2 = list(SffIterator(handle)) 1344 assert len(sff) == len(sff2) 1345 for old, new in zip(sff, sff2): 1346 assert old.id == new.id 1347 assert str(old.seq) == str(new.seq) 1348 1349 with open("../../Tests/Roche/E3MFGYR02_index_at_start.sff", "rb") as handle: 1350 sff2 = list(SffIterator(handle)) 1351 assert len(sff) == len(sff2) 1352 for old, new in zip(sff, sff2): 1353 assert old.id == new.id 1354 assert str(old.seq) == str(new.seq) 1355 1356 with open("../../Tests/Roche/E3MFGYR02_index_in_middle.sff", "rb") as handle: 1357 sff2 = list(SffIterator(handle)) 1358 assert len(sff) == len(sff2) 1359 for old, new in zip(sff, sff2): 1360 assert old.id == new.id 1361 assert str(old.seq) == str(new.seq) 1362 1363 with open(filename, "rb") as handle: 1364 sff_trim = list(SffIterator(handle, trim=True)) 1365 1366 with open(filename, "rb") as handle: 1367 print(ReadRocheXmlManifest(handle)) 1368 1369 from Bio import SeqIO 1370 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads_no_trim.fasta" 1371 fasta_no_trim = list(SeqIO.parse(filename, "fasta")) 1372 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads_no_trim.qual" 1373 qual_no_trim = list(SeqIO.parse(filename, "qual")) 1374 1375 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.fasta" 1376 fasta_trim = list(SeqIO.parse(filename, "fasta")) 1377 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.qual" 1378 qual_trim = list(SeqIO.parse(filename, "qual")) 1379 1380 for s, sT, f, q, fT, qT in zip(sff, sff_trim, fasta_no_trim, 1381 qual_no_trim, fasta_trim, qual_trim): 1382 #print("") 1383 print(s.id) 1384 #print(s.seq) 1385 #print(s.letter_annotations["phred_quality"]) 1386 1387 assert s.id == f.id == q.id 1388 assert str(s.seq) == str(f.seq) 1389 assert s.letter_annotations[ 1390 "phred_quality"] == q.letter_annotations["phred_quality"] 1391 1392 assert s.id == sT.id == fT.id == qT.id 1393 assert str(sT.seq) == str(fT.seq) 1394 assert sT.letter_annotations[ 1395 "phred_quality"] == qT.letter_annotations["phred_quality"] 1396 1397 print("Writing with a list of SeqRecords...") 1398 handle = BytesIO() 1399 w = SffWriter(handle, xml=metadata) 1400 w.write_file(sff) # list 1401 data = handle.getvalue() 1402 print("And again with an iterator...") 1403 handle = BytesIO() 1404 w = SffWriter(handle, xml=metadata) 1405 w.write_file(iter(sff)) 1406 assert data == handle.getvalue() 1407 #Check 100% identical to the original: 1408 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.sff" 1409 with open(filename, "rb") as handle: 1410 original = handle.read() 1411 assert len(data) == len(original) 1412 assert data == original 1413 del data 1414 1415 print("-" * 50) 1416 filename = "../../Tests/Roche/greek.sff" 1417 with open(filename, "rb") as handle: 1418 for record in SffIterator(handle): 1419 print(record.id) 1420 with open(filename, "rb") as handle: 1421 index1 = sorted(_sff_read_roche_index(handle)) 1422 with open(filename, "rb") as handle: 1423 index2 = sorted(_sff_do_slow_index(handle)) 1424 assert index1 == index2 1425 try: 1426 with open(filename, "rb") as handle: 1427 print(ReadRocheXmlManifest(handle)) 1428 assert False, "Should fail!" 1429 except ValueError: 1430 pass 1431 1432 with open(filename, "rb") as handle: 1433 for record in SffIterator(handle): 1434 pass 1435 try: 1436 for record in SffIterator(handle): 1437 print(record.id) 1438 assert False, "Should have failed" 1439 except ValueError as err: 1440 print("Checking what happens on re-reading a handle:") 1441 print(err) 1442 1443 """ 1444 #Ugly code to make test files... 1445 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1446 padding = len(index)%8 1447 if padding: 1448 padding = 8 - padding 1449 index += chr(0)*padding 1450 assert len(index)%8 == 0 1451 1452 #Ugly bit of code to make a fake index at start 1453 records = list(SffIterator( 1454 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1455 out_handle = open( 1456 "../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "w") 1457 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1458 padding = len(index)%8 1459 if padding: 1460 padding = 8 - padding 1461 index += chr(0)*padding 1462 w = SffWriter(out_handle, index=False, xml=None) 1463 #Fake the header... 1464 w._number_of_reads = len(records) 1465 w._index_start = 0 1466 w._index_length = 0 1467 w._key_sequence = records[0].annotations["flow_key"] 1468 w._flow_chars = records[0].annotations["flow_chars"] 1469 w._number_of_flows_per_read = len(w._flow_chars) 1470 w.write_header() 1471 w._index_start = out_handle.tell() 1472 w._index_length = len(index) 1473 out_handle.seek(0) 1474 w.write_header() #this time with index info 1475 w.handle.write(index) 1476 for record in records: 1477 w.write_record(record) 1478 out_handle.close() 1479 records2 = list(SffIterator( 1480 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1481 for old, new in zip(records, records2): 1482 assert str(old.seq)==str(new.seq) 1483 i = list(_sff_do_slow_index( 1484 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1485 1486 #Ugly bit of code to make a fake index in middle 1487 records = list(SffIterator( 1488 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1489 out_handle = open( 1490 "../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "w") 1491 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1492 padding = len(index)%8 1493 if padding: 1494 padding = 8 - padding 1495 index += chr(0)*padding 1496 w = SffWriter(out_handle, index=False, xml=None) 1497 #Fake the header... 1498 w._number_of_reads = len(records) 1499 w._index_start = 0 1500 w._index_length = 0 1501 w._key_sequence = records[0].annotations["flow_key"] 1502 w._flow_chars = records[0].annotations["flow_chars"] 1503 w._number_of_flows_per_read = len(w._flow_chars) 1504 w.write_header() 1505 for record in records[:5]: 1506 w.write_record(record) 1507 w._index_start = out_handle.tell() 1508 w._index_length = len(index) 1509 w.handle.write(index) 1510 for record in records[5:]: 1511 w.write_record(record) 1512 out_handle.seek(0) 1513 w.write_header() #this time with index info 1514 out_handle.close() 1515 records2 = list(SffIterator( 1516 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1517 for old, new in zip(records, records2): 1518 assert str(old.seq)==str(new.seq) 1519 j = list(_sff_do_slow_index( 1520 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1521 1522 #Ugly bit of code to make a fake index at end 1523 records = list(SffIterator( 1524 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1525 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "w") as out_handle: 1526 w = SffWriter(out_handle, index=False, xml=None) 1527 #Fake the header... 1528 w._number_of_reads = len(records) 1529 w._index_start = 0 1530 w._index_length = 0 1531 w._key_sequence = records[0].annotations["flow_key"] 1532 w._flow_chars = records[0].annotations["flow_chars"] 1533 w._number_of_flows_per_read = len(w._flow_chars) 1534 w.write_header() 1535 for record in records: 1536 w.write_record(record) 1537 w._index_start = out_handle.tell() 1538 w._index_length = len(index) 1539 out_handle.write(index) 1540 out_handle.seek(0) 1541 w.write_header() #this time with index info 1542 records2 = list(SffIterator( 1543 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1544 for old, new in zip(records, records2): 1545 assert str(old.seq)==str(new.seq) 1546 try: 1547 print(ReadRocheXmlManifest( 1548 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1549 assert False, "Should fail!" 1550 except ValueError: 1551 pass 1552 k = list(_sff_do_slow_index( 1553 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1554 """ 1555 1556 print("Done") 1557