Package Bio :: Package SeqIO :: Module SffIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.SffIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # Based on code contributed and copyright 2009 by Jose Blanca (COMAV-UPV). 
   3  # 
   4  # This code is part of the Biopython distribution and governed by its 
   5  # license.  Please see the LICENSE file that should have been included 
   6  # as part of this package. 
   7  """Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format. 
   8   
   9  SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for 
  10  Biomedical Research and the Wellcome Trust Sanger Institute. SFF was also used 
  11  as the native output format from early versions of Ion Torrent's PGM platform 
  12  as well. You are expected to use this module via the Bio.SeqIO functions under 
  13  the format name "sff" (or "sff-trim" as described below). 
  14   
  15  For example, to iterate over the records in an SFF file, 
  16   
  17      >>> from Bio import SeqIO 
  18      >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"): 
  19      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
  20      E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT... 
  21      E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA... 
  22      E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC... 
  23      E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT... 
  24      E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA... 
  25      E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC... 
  26      E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA... 
  27      E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG... 
  28      E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC... 
  29      E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA... 
  30   
  31  Each SeqRecord object will contain all the annotation from the SFF file, 
  32  including the PHRED quality scores. 
  33   
  34      >>> print("%s %i" % (record.id, len(record))) 
  35      E3MFGYR02F7Z7G 219 
  36      >>> print("%s..." % record.seq[:10]) 
  37      tcagAATCAT... 
  38      >>> print("%r..." % (record.letter_annotations["phred_quality"][:10])) 
  39      [22, 21, 23, 28, 26, 15, 12, 21, 28, 21]... 
  40   
  41  Notice that the sequence is given in mixed case, the central upper case region 
  42  corresponds to the trimmed sequence. This matches the output of the Roche 
  43  tools (and the 3rd party tool sff_extract) for SFF to FASTA. 
  44   
  45      >>> print(record.annotations["clip_qual_left"]) 
  46      4 
  47      >>> print(record.annotations["clip_qual_right"]) 
  48      134 
  49      >>> print(record.seq[:4]) 
  50      tcag 
  51      >>> print("%s...%s" % (record.seq[4:20], record.seq[120:134])) 
  52      AATCATCCACTTTTTA...CAAAACACAAACAG 
  53      >>> print(record.seq[134:]) 
  54      atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn 
  55   
  56  The annotations dictionary also contains any adapter clip positions 
  57  (usually zero), and information about the flows. e.g. 
  58   
  59      >>> len(record.annotations) 
  60      11 
  61      >>> print(record.annotations["flow_key"]) 
  62      TCAG 
  63      >>> print(record.annotations["flow_values"][:10]) 
  64      (83, 1, 128, 7, 4, 84, 6, 106, 3, 172) 
  65      >>> print(len(record.annotations["flow_values"])) 
  66      400 
  67      >>> print(record.annotations["flow_index"][:10]) 
  68      (1, 2, 3, 2, 2, 0, 3, 2, 3, 3) 
  69      >>> print(len(record.annotations["flow_index"])) 
  70      219 
  71   
  72  Note that to convert from a raw reading in flow_values to the corresponding 
  73  homopolymer stretch estimate, the value should be rounded to the nearest 100: 
  74   
  75      >>> print("%r..." % [int(round(value, -2)) // 100 
  76      ...                  for value in record.annotations["flow_values"][:10]]) 
  77      [1, 0, 1, 0, 0, 1, 0, 1, 0, 2]... 
  78   
  79  If a read name is exactly 14 alphanumeric characters, the annotations 
  80  dictionary will also contain meta-data about the read extracted by 
  81  interpretting the name as a 454 Sequencing System "Universal" Accession 
  82  Number. Note that if a read name happens to be exactly 14 alphanumeric 
  83  characters but was not generated automatically, these annotation records 
  84  will contain nonsense information. 
  85   
  86      >>> print(record.annotations["region"]) 
  87      2 
  88      >>> print(record.annotations["time"]) 
  89      [2008, 1, 9, 16, 16, 0] 
  90      >>> print(record.annotations["coords"]) 
  91      (2434, 1658) 
  92   
  93  As a convenience method, you can read the file with SeqIO format name "sff-trim" 
  94  instead of "sff" to get just the trimmed sequences (without any annotation 
  95  except for the PHRED quality scores and anything encoded in the read names): 
  96   
  97      >>> from Bio import SeqIO 
  98      >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim"): 
  99      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 100      E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC... 
 101      E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC... 
 102      E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG... 
 103      E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT... 
 104      E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA... 
 105      E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC... 
 106      E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC... 
 107      E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC... 
 108      E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG... 
 109      E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT... 
 110   
 111  Looking at the final record in more detail, note how this differs to the 
 112  example above: 
 113   
 114      >>> print("%s %i" % (record.id, len(record))) 
 115      E3MFGYR02F7Z7G 130 
 116      >>> print("%s..." % record.seq[:10]) 
 117      AATCATCCAC... 
 118      >>> print("%r..." % record.letter_annotations["phred_quality"][:10]) 
 119      [26, 15, 12, 21, 28, 21, 36, 28, 27, 27]... 
 120      >>> len(record.annotations) 
 121      3 
 122      >>> print(record.annotations["region"]) 
 123      2 
 124      >>> print(record.annotations["coords"]) 
 125      (2434, 1658) 
 126      >>> print(record.annotations["time"]) 
 127      [2008, 1, 9, 16, 16, 0] 
 128   
 129  You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF 
 130  reads into a FASTQ file (or a FASTA file and a QUAL file), e.g. 
 131   
 132      >>> from Bio import SeqIO 
 133      >>> try: 
 134      ...     from StringIO import StringIO # Python 2 
 135      ... except ImportError: 
 136      ...     from io import StringIO # Python 3 
 137      ... 
 138      >>> out_handle = StringIO() 
 139      >>> count = SeqIO.convert("Roche/E3MFGYR02_random_10_reads.sff", "sff", 
 140      ...                       out_handle, "fastq") 
 141      >>> print("Converted %i records" % count) 
 142      Converted 10 records 
 143   
 144  The output FASTQ file would start like this: 
 145   
 146      >>> print("%s..." % out_handle.getvalue()[:50]) 
 147      @E3MFGYR02JWQ7T 
 148      tcagGGTCTACATGTTGGTTAACCCGTACTGATT... 
 149   
 150  Bio.SeqIO.index() provides memory efficient random access to the reads in an 
 151  SFF file by name. SFF files can include an index within the file, which can 
 152  be read in making this very fast. If the index is missing (or in a format not 
 153  yet supported in Biopython) the file is indexed by scanning all the reads - 
 154  which is a little slower. For example, 
 155   
 156      >>> from Bio import SeqIO 
 157      >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 158      >>> record = reads["E3MFGYR02JHD4H"] 
 159      >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 160      E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC... 
 161   
 162  Or, using the trimmed reads: 
 163   
 164      >>> from Bio import SeqIO 
 165      >>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim") 
 166      >>> record = reads["E3MFGYR02JHD4H"] 
 167      >>> print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 168      E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG... 
 169      >>> reads.close() 
 170   
 171  You can also use the Bio.SeqIO.write() function with the "sff" format. Note 
 172  that this requires all the flow information etc, and thus is probably only 
 173  useful for SeqRecord objects originally from reading another SFF file (and 
 174  not the trimmed SeqRecord objects from parsing an SFF file as "sff-trim"). 
 175   
 176  As an example, let's pretend this example SFF file represents some DNA which 
 177  was pre-amplified with a PCR primers AAAGANNNNN. The following script would 
 178  produce a sub-file containing all those reads whose post-quality clipping 
 179  region (i.e. the sequence after trimming) starts with AAAGA exactly (the non- 
 180  degenerate bit of this pretend primer): 
 181   
 182      >>> from Bio import SeqIO 
 183      >>> records = (record for record in 
 184      ...            SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 185      ...            if record.seq[record.annotations["clip_qual_left"]:].startswith("AAAGA")) 
 186      >>> count = SeqIO.write(records, "temp_filtered.sff", "sff") 
 187      >>> print("Selected %i records" % count) 
 188      Selected 2 records 
 189   
 190  Of course, for an assembly you would probably want to remove these primers. 
 191  If you want FASTA or FASTQ output, you could just slice the SeqRecord. However, 
 192  if you want SFF output we have to preserve all the flow information - the trick 
 193  is just to adjust the left clip position! 
 194   
 195      >>> from Bio import SeqIO 
 196      >>> def filter_and_trim(records, primer): 
 197      ...     for record in records: 
 198      ...         if record.seq[record.annotations["clip_qual_left"]:].startswith(primer): 
 199      ...             record.annotations["clip_qual_left"] += len(primer) 
 200      ...             yield record 
 201      >>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") 
 202      >>> count = SeqIO.write(filter_and_trim(records, "AAAGA"), 
 203      ...                     "temp_filtered.sff", "sff") 
 204      >>> print("Selected %i records" % count) 
 205      Selected 2 records 
 206   
 207  We can check the results, note the lower case clipped region now includes the "AAAGA" 
 208  sequence: 
 209   
 210      >>> for record in SeqIO.parse("temp_filtered.sff", "sff"): 
 211      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 212      E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC... 
 213      E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA... 
 214      >>> for record in SeqIO.parse("temp_filtered.sff", "sff-trim"): 
 215      ...     print("%s %i %s..." % (record.id, len(record), record.seq[:20])) 
 216      E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG... 
 217      E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG... 
 218      >>> import os 
 219      >>> os.remove("temp_filtered.sff") 
 220   
 221  For a description of the file format, please see the Roche manuals and: 
 222  http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats 
 223   
 224  """ 
 225   
 226  from __future__ import print_function 
 227   
 228  from Bio.SeqIO.Interfaces import SequenceWriter 
 229  from Bio import Alphabet 
 230  from Bio.Seq import Seq 
 231  from Bio.SeqRecord import SeqRecord 
 232  import struct 
 233  import sys 
 234  import re 
 235   
 236  from Bio._py3k import _bytes_to_string, _as_bytes 
 237  _null = b"\0" 
 238  _sff = b".sff" 
 239  _hsh = b".hsh" 
 240  _srt = b".srt" 
 241  _mft = b".mft" 
 242  _flag = b"\xff" 
 243   
 244   
245 -def _sff_file_header(handle):
246 """Read in an SFF file header (PRIVATE). 247 248 Assumes the handle is at the start of the file, will read forwards 249 though the header and leave the handle pointing at the first record. 250 Returns a tuple of values from the header (header_length, index_offset, 251 index_length, number_of_reads, flows_per_read, flow_chars, key_sequence) 252 253 >>> with open("Roche/greek.sff", "rb") as handle: 254 ... values = _sff_file_header(handle) 255 >>> print(values[0]) 256 840 257 >>> print(values[1]) 258 65040 259 >>> print(values[2]) 260 256 261 >>> print(values[3]) 262 24 263 >>> print(values[4]) 264 800 265 >>> values[-1] 266 'TCAG' 267 268 """ 269 if hasattr(handle, "mode") and "U" in handle.mode.upper(): 270 raise ValueError("SFF files must NOT be opened in universal new " 271 "lines mode. Binary mode is recommended (although " 272 "on Unix the default mode is also fine).") 273 elif hasattr(handle, "mode") and "B" not in handle.mode.upper() \ 274 and sys.platform == "win32": 275 raise ValueError("SFF files must be opened in binary mode on Windows") 276 #file header (part one) 277 #use big endiean encdoing > 278 #magic_number I 279 #version 4B 280 #index_offset Q 281 #index_length I 282 #number_of_reads I 283 #header_length H 284 #key_length H 285 #number_of_flows_per_read H 286 #flowgram_format_code B 287 #[rest of file header depends on the number of flows and how many keys] 288 fmt = '>4s4BQIIHHHB' 289 assert 31 == struct.calcsize(fmt) 290 data = handle.read(31) 291 if not data: 292 raise ValueError("Empty file.") 293 elif len(data) < 13: 294 raise ValueError("File too small to hold a valid SFF header.") 295 magic_number, ver0, ver1, ver2, ver3, index_offset, index_length, \ 296 number_of_reads, header_length, key_length, number_of_flows_per_read, \ 297 flowgram_format = struct.unpack(fmt, data) 298 if magic_number in [_hsh, _srt, _mft]: 299 #Probably user error, calling Bio.SeqIO.parse() twice! 300 raise ValueError("Handle seems to be at SFF index block, not start") 301 if magic_number != _sff: # 779314790 302 raise ValueError("SFF file did not start '.sff', but %s" 303 % repr(magic_number)) 304 if (ver0, ver1, ver2, ver3) != (0, 0, 0, 1): 305 raise ValueError("Unsupported SFF version in header, %i.%i.%i.%i" 306 % (ver0, ver1, ver2, ver3)) 307 if flowgram_format != 1: 308 raise ValueError("Flowgram format code %i not supported" 309 % flowgram_format) 310 if (index_offset != 0) ^ (index_length != 0): 311 raise ValueError("Index offset %i but index length %i" 312 % (index_offset, index_length)) 313 flow_chars = _bytes_to_string(handle.read(number_of_flows_per_read)) 314 key_sequence = _bytes_to_string(handle.read(key_length)) 315 #According to the spec, the header_length field should be the total number 316 #of bytes required by this set of header fields, and should be equal to 317 #"31 + number_of_flows_per_read + key_length" rounded up to the next value 318 #divisible by 8. 319 assert header_length % 8 == 0 320 padding = header_length - number_of_flows_per_read - key_length - 31 321 assert 0 <= padding < 8, padding 322 if handle.read(padding).count(_null) != padding: 323 import warnings 324 from Bio import BiopythonParserWarning 325 warnings.warn("Your SFF file is invalid, post header %i byte " 326 "null padding region contained data." % padding, 327 BiopythonParserWarning) 328 return header_length, index_offset, index_length, \ 329 number_of_reads, number_of_flows_per_read, \ 330 flow_chars, key_sequence
331 332
333 -def _sff_do_slow_index(handle):
334 """Generates an index by scanning though all the reads in an SFF file (PRIVATE). 335 336 This is a slow but generic approach if we can't parse the provided index 337 (if present). 338 339 Will use the handle seek/tell functions. 340 """ 341 handle.seek(0) 342 header_length, index_offset, index_length, number_of_reads, \ 343 number_of_flows_per_read, flow_chars, key_sequence \ 344 = _sff_file_header(handle) 345 #Now on to the reads... 346 read_header_fmt = '>2HI4H' 347 read_header_size = struct.calcsize(read_header_fmt) 348 #NOTE - assuming flowgram_format==1, which means struct type H 349 read_flow_fmt = ">%iH" % number_of_flows_per_read 350 read_flow_size = struct.calcsize(read_flow_fmt) 351 assert 1 == struct.calcsize(">B") 352 assert 1 == struct.calcsize(">s") 353 assert 1 == struct.calcsize(">c") 354 assert read_header_size % 8 == 0 # Important for padding calc later! 355 for read in range(number_of_reads): 356 record_offset = handle.tell() 357 if record_offset == index_offset: 358 #Found index block within reads, ignore it: 359 offset = index_offset + index_length 360 if offset % 8: 361 offset += 8 - (offset % 8) 362 assert offset % 8 == 0 363 handle.seek(offset) 364 record_offset = offset 365 #assert record_offset%8 == 0 #Worth checking, but slow 366 #First the fixed header 367 data = handle.read(read_header_size) 368 read_header_length, name_length, seq_len, clip_qual_left, \ 369 clip_qual_right, clip_adapter_left, clip_adapter_right \ 370 = struct.unpack(read_header_fmt, data) 371 if read_header_length < 10 or read_header_length % 8 != 0: 372 raise ValueError("Malformed read header, says length is %i:\n%s" 373 % (read_header_length, repr(data))) 374 #now the name and any padding (remainder of header) 375 name = _bytes_to_string(handle.read(name_length)) 376 padding = read_header_length - read_header_size - name_length 377 if handle.read(padding).count(_null) != padding: 378 import warnings 379 from Bio import BiopythonParserWarning 380 warnings.warn("Your SFF file is invalid, post name %i byte " 381 "padding region contained data" % padding, 382 BiopythonParserWarning) 383 assert record_offset + read_header_length == handle.tell() 384 #now the flowgram values, flowgram index, bases and qualities 385 size = read_flow_size + 3 * seq_len 386 handle.seek(size, 1) 387 #now any padding... 388 padding = size % 8 389 if padding: 390 padding = 8 - padding 391 if handle.read(padding).count(_null) != padding: 392 import warnings 393 from Bio import BiopythonParserWarning 394 warnings.warn("Your SFF file is invalid, post quality %i " 395 "byte padding region contained data" % padding, 396 BiopythonParserWarning) 397 #print("%s %s %i" % (read, name, record_offset)) 398 yield name, record_offset 399 if handle.tell() % 8 != 0: 400 raise ValueError( 401 "After scanning reads, did not end on a multiple of 8")
402 403
404 -def _sff_find_roche_index(handle):
405 """Locate any existing Roche style XML meta data and read index (PRIVATE). 406 407 Makes a number of hard coded assumptions based on reverse engineered SFF 408 files from Roche 454 machines. 409 410 Returns a tuple of read count, SFF "index" offset and size, XML offset 411 and size, and the actual read index offset and size. 412 413 Raises a ValueError for unsupported or non-Roche index blocks. 414 """ 415 handle.seek(0) 416 header_length, index_offset, index_length, number_of_reads, \ 417 number_of_flows_per_read, flow_chars, key_sequence \ 418 = _sff_file_header(handle) 419 assert handle.tell() == header_length 420 if not index_offset or not index_offset: 421 raise ValueError("No index present in this SFF file") 422 #Now jump to the header... 423 handle.seek(index_offset) 424 fmt = ">4s4B" 425 fmt_size = struct.calcsize(fmt) 426 data = handle.read(fmt_size) 427 if not data: 428 raise ValueError("Premature end of file? Expected index of size %i at offest %i, found nothing" 429 % (index_length, index_offset)) 430 if len(data) < fmt_size: 431 raise ValueError("Premature end of file? Expected index of size %i at offest %i, found %s" 432 % (index_length, index_offset, repr(data))) 433 magic_number, ver0, ver1, ver2, ver3 = struct.unpack(fmt, data) 434 if magic_number == _mft: # 778921588 435 #Roche 454 manifest index 436 #This is typical from raw Roche 454 SFF files (2009), and includes 437 #both an XML manifest and the sorted index. 438 if (ver0, ver1, ver2, ver3) != (49, 46, 48, 48): 439 #This is "1.00" as a string 440 raise ValueError("Unsupported version in .mft index header, %i.%i.%i.%i" 441 % (ver0, ver1, ver2, ver3)) 442 fmt2 = ">LL" 443 fmt2_size = struct.calcsize(fmt2) 444 xml_size, data_size = struct.unpack(fmt2, handle.read(fmt2_size)) 445 if index_length != fmt_size + fmt2_size + xml_size + data_size: 446 raise ValueError("Problem understanding .mft index header, %i != %i + %i + %i + %i" 447 % (index_length, fmt_size, fmt2_size, xml_size, data_size)) 448 return number_of_reads, header_length, \ 449 index_offset, index_length, \ 450 index_offset + fmt_size + fmt2_size, xml_size, \ 451 index_offset + fmt_size + fmt2_size + xml_size, data_size 452 elif magic_number == _srt: # 779317876 453 #Roche 454 sorted index 454 #I've had this from Roche tool sfffile when the read identifiers 455 #had nonstandard lengths and there was no XML manifest. 456 if (ver0, ver1, ver2, ver3) != (49, 46, 48, 48): 457 #This is "1.00" as a string 458 raise ValueError("Unsupported version in .srt index header, %i.%i.%i.%i" 459 % (ver0, ver1, ver2, ver3)) 460 data = handle.read(4) 461 if data != _null * 4: 462 raise ValueError( 463 "Did not find expected null four bytes in .srt index") 464 return number_of_reads, header_length, \ 465 index_offset, index_length, \ 466 0, 0, \ 467 index_offset + fmt_size + 4, index_length - fmt_size - 4 468 elif magic_number == _hsh: 469 raise ValueError("Hash table style indexes (.hsh) in SFF files are " 470 "not (yet) supported") 471 else: 472 raise ValueError("Unknown magic number %s in SFF index header:\n%s" 473 % (repr(magic_number), repr(data)))
474 475
476 -def ReadRocheXmlManifest(handle):
477 """Reads any Roche style XML manifest data in the SFF "index". 478 479 The SFF file format allows for multiple different index blocks, and Roche 480 took advantage of this to define their own index block which also embeds 481 an XML manifest string. This is not a publically documented extension to 482 the SFF file format, this was reverse engineered. 483 484 The handle should be to an SFF file opened in binary mode. This function 485 will use the handle seek/tell functions and leave the handle in an 486 arbitrary location. 487 488 Any XML manifest found is returned as a Python string, which you can then 489 parse as appropriate, or reuse when writing out SFF files with the 490 SffWriter class. 491 492 Returns a string, or raises a ValueError if an Roche manifest could not be 493 found. 494 """ 495 number_of_reads, header_length, index_offset, index_length, xml_offset, \ 496 xml_size, read_index_offset, read_index_size = _sff_find_roche_index( 497 handle) 498 if not xml_offset or not xml_size: 499 raise ValueError("No XML manifest found") 500 handle.seek(xml_offset) 501 return _bytes_to_string(handle.read(xml_size))
502 503 504 #This is a generator function!
505 -def _sff_read_roche_index(handle):
506 """Reads any existing Roche style read index provided in the SFF file (PRIVATE). 507 508 Will use the handle seek/tell functions. 509 510 This works on ".srt1.00" and ".mft1.00" style Roche SFF index blocks. 511 512 Roche SFF indices use base 255 not 256, meaning we see bytes in range the 513 range 0 to 254 only. This appears to be so that byte 0xFF (character 255) 514 can be used as a marker character to separate entries (required if the 515 read name lengths vary). 516 517 Note that since only four bytes are used for the read offset, this is 518 limited to 255^4 bytes (nearly 4GB). If you try to use the Roche sfffile 519 tool to combine SFF files beyound this limit, they issue a warning and 520 omit the index (and manifest). 521 """ 522 number_of_reads, header_length, index_offset, index_length, xml_offset, \ 523 xml_size, read_index_offset, read_index_size = _sff_find_roche_index( 524 handle) 525 #Now parse the read index... 526 handle.seek(read_index_offset) 527 fmt = ">5B" 528 for read in range(number_of_reads): 529 #TODO - Be more aware of when the index should end? 530 data = handle.read(6) 531 while True: 532 more = handle.read(1) 533 if not more: 534 raise ValueError("Premature end of file!") 535 data += more 536 if more == _flag: 537 break 538 assert data[-1:] == _flag, data[-1:] 539 name = _bytes_to_string(data[:-6]) 540 off4, off3, off2, off1, off0 = struct.unpack(fmt, data[-6:-1]) 541 offset = off0 + 255 * off1 + 65025 * off2 + 16581375 * off3 542 if off4: 543 #Could in theory be used as a fifth piece of offset information, 544 #i.e. offset =+ 4228250625L*off4, but testing the Roche tools this 545 #is not the case. They simple don't support such large indexes. 546 raise ValueError("Expected a null terminator to the read name.") 547 yield name, offset 548 if handle.tell() != read_index_offset + read_index_size: 549 raise ValueError("Problem with index length? %i vs %i" 550 % (handle.tell(), read_index_offset + read_index_size))
551 552 _valid_UAN_read_name = re.compile(r'^[a-zA-Z0-9]{14}$') 553 554
555 -def _sff_read_seq_record(handle, number_of_flows_per_read, flow_chars, 556 key_sequence, alphabet, trim=False):
557 """Parse the next read in the file, return data as a SeqRecord (PRIVATE).""" 558 #Now on to the reads... 559 #the read header format (fixed part): 560 #read_header_length H 561 #name_length H 562 #seq_len I 563 #clip_qual_left H 564 #clip_qual_right H 565 #clip_adapter_left H 566 #clip_adapter_right H 567 #[rest of read header depends on the name length etc] 568 read_header_fmt = '>2HI4H' 569 read_header_size = struct.calcsize(read_header_fmt) 570 read_flow_fmt = ">%iH" % number_of_flows_per_read 571 read_flow_size = struct.calcsize(read_flow_fmt) 572 573 read_header_length, name_length, seq_len, clip_qual_left, \ 574 clip_qual_right, clip_adapter_left, clip_adapter_right \ 575 = struct.unpack(read_header_fmt, handle.read(read_header_size)) 576 if clip_qual_left: 577 clip_qual_left -= 1 # python counting 578 if clip_adapter_left: 579 clip_adapter_left -= 1 # python counting 580 if read_header_length < 10 or read_header_length % 8 != 0: 581 raise ValueError("Malformed read header, says length is %i" 582 % read_header_length) 583 #now the name and any padding (remainder of header) 584 name = _bytes_to_string(handle.read(name_length)) 585 padding = read_header_length - read_header_size - name_length 586 if handle.read(padding).count(_null) != padding: 587 import warnings 588 from Bio import BiopythonParserWarning 589 warnings.warn("Your SFF file is invalid, post name %i " 590 "byte padding region contained data" % padding, 591 BiopythonParserWarning) 592 #now the flowgram values, flowgram index, bases and qualities 593 #NOTE - assuming flowgram_format==1, which means struct type H 594 flow_values = handle.read(read_flow_size) # unpack later if needed 595 temp_fmt = ">%iB" % seq_len # used for flow index and quals 596 flow_index = handle.read(seq_len) # unpack later if needed 597 seq = _bytes_to_string(handle.read(seq_len)) # TODO - Use bytes in Seq? 598 quals = list(struct.unpack(temp_fmt, handle.read(seq_len))) 599 #now any padding... 600 padding = (read_flow_size + seq_len * 3) % 8 601 if padding: 602 padding = 8 - padding 603 if handle.read(padding).count(_null) != padding: 604 import warnings 605 from Bio import BiopythonParserWarning 606 warnings.warn("Your SFF file is invalid, post quality %i " 607 "byte padding region contained data" % padding, 608 BiopythonParserWarning) 609 #Follow Roche and apply most aggressive of qual and adapter clipping. 610 #Note Roche seems to ignore adapter clip fields when writing SFF, 611 #and uses just the quality clipping values for any clipping. 612 clip_left = max(clip_qual_left, clip_adapter_left) 613 #Right clipping of zero means no clipping 614 if clip_qual_right: 615 if clip_adapter_right: 616 clip_right = min(clip_qual_right, clip_adapter_right) 617 else: 618 #Typical case with Roche SFF files 619 clip_right = clip_qual_right 620 elif clip_adapter_right: 621 clip_right = clip_adapter_right 622 else: 623 clip_right = seq_len 624 #Now build a SeqRecord 625 if trim: 626 seq = seq[clip_left:clip_right].upper() 627 quals = quals[clip_left:clip_right] 628 #Don't record the clipping values, flow etc, they make no sense now: 629 annotations = {} 630 else: 631 #This use of mixed case mimics the Roche SFF tool's FASTA output 632 seq = seq[:clip_left].lower() + \ 633 seq[clip_left:clip_right].upper() + \ 634 seq[clip_right:].lower() 635 annotations = {"flow_values": struct.unpack(read_flow_fmt, flow_values), 636 "flow_index": struct.unpack(temp_fmt, flow_index), 637 "flow_chars": flow_chars, 638 "flow_key": key_sequence, 639 "clip_qual_left": clip_qual_left, 640 "clip_qual_right": clip_qual_right, 641 "clip_adapter_left": clip_adapter_left, 642 "clip_adapter_right": clip_adapter_right} 643 if re.match(_valid_UAN_read_name, name): 644 annotations["time"] = _get_read_time(name) 645 annotations["region"] = _get_read_region(name) 646 annotations["coords"] = _get_read_xy(name) 647 record = SeqRecord(Seq(seq, alphabet), 648 id=name, 649 name=name, 650 description="", 651 annotations=annotations) 652 #Dirty trick to speed up this line: 653 #record.letter_annotations["phred_quality"] = quals 654 dict.__setitem__(record._per_letter_annotations, 655 "phred_quality", quals) 656 #Return the record and then continue... 657 return record
658 659 _powers_of_36 = [36 ** i for i in range(6)] 660 661
662 -def _string_as_base_36(string):
663 """Interpret a string as a base-36 number as per 454 manual.""" 664 total = 0 665 for c, power in zip(string[::-1], _powers_of_36): 666 # For reference: ord('0') = 48, ord('9') = 57 667 # For reference: ord('A') = 65, ord('Z') = 90 668 # For reference: ord('a') = 97, ord('z') = 122 669 if 48 <= ord(c) <= 57: 670 val = ord(c) - 22 # equivalent to: - ord('0') + 26 671 elif 65 <= ord(c) <= 90: 672 val = ord(c) - 65 673 elif 97 <= ord(c) <= 122: 674 val = ord(c) - 97 675 else: 676 # Invalid character 677 val = 0 678 total += val * power 679 return total
680 681
682 -def _get_read_xy(read_name):
683 """Extract coordinates from last 5 characters of read name.""" 684 number = _string_as_base_36(read_name[9:]) 685 return divmod(number, 4096)
686 687 _time_denominators = [13 * 32 * 24 * 60 * 60, 688 32 * 24 * 60 * 60, 689 24 * 60 * 60, 690 60 * 60, 691 60] 692 693
694 -def _get_read_time(read_name):
695 """Extract time from first 6 characters of read name.""" 696 time_list = [] 697 remainder = _string_as_base_36(read_name[:6]) 698 for denominator in _time_denominators: 699 this_term, remainder = divmod(remainder, denominator) 700 time_list.append(this_term) 701 time_list.append(remainder) 702 time_list[0] += 2000 703 return time_list
704 705
706 -def _get_read_region(read_name):
707 """Extract region from read name.""" 708 return int(read_name[8])
709 710
711 -def _sff_read_raw_record(handle, number_of_flows_per_read):
712 """Extract the next read in the file as a raw (bytes) string (PRIVATE).""" 713 read_header_fmt = '>2HI' 714 read_header_size = struct.calcsize(read_header_fmt) 715 read_flow_fmt = ">%iH" % number_of_flows_per_read 716 read_flow_size = struct.calcsize(read_flow_fmt) 717 718 raw = handle.read(read_header_size) 719 read_header_length, name_length, seq_len \ 720 = struct.unpack(read_header_fmt, raw) 721 if read_header_length < 10 or read_header_length % 8 != 0: 722 raise ValueError("Malformed read header, says length is %i" 723 % read_header_length) 724 #now the four clip values (4H = 8 bytes), and read name 725 raw += handle.read(8 + name_length) 726 #and any padding (remainder of header) 727 padding = read_header_length - read_header_size - 8 - name_length 728 pad = handle.read(padding) 729 if pad.count(_null) != padding: 730 import warnings 731 from Bio import BiopythonParserWarning 732 warnings.warn("Your SFF file is invalid, post name %i " 733 "byte padding region contained data" % padding, 734 BiopythonParserWarning) 735 raw += pad 736 #now the flowgram values, flowgram index, bases and qualities 737 raw += handle.read(read_flow_size + seq_len * 3) 738 padding = (read_flow_size + seq_len * 3) % 8 739 #now any padding... 740 if padding: 741 padding = 8 - padding 742 pad = handle.read(padding) 743 if pad.count(_null) != padding: 744 import warnings 745 from Bio import BiopythonParserWarning 746 warnings.warn("Your SFF file is invalid, post quality %i " 747 "byte padding region contained data" % padding, 748 BiopythonParserWarning) 749 raw += pad 750 #Return the raw bytes 751 return raw
752 753
754 -class _AddTellHandle(object):
755 """Wrapper for handles which do not support the tell method (PRIVATE). 756 757 Intended for use with things like network handles where tell (and reverse 758 seek) are not supported. The SFF file needs to track the current offset in 759 order to deal with the index block. 760 """
761 - def __init__(self, handle):
762 self._handle = handle 763 self._offset = 0
764
765 - def read(self, length):
766 data = self._handle.read(length) 767 self._offset += len(data) 768 return data
769
770 - def tell(self):
771 return self._offset
772
773 - def seek(self, offset):
774 if offset < self._offset: 775 raise RunTimeError("Can't seek backwards") 776 self._handle.read(offset - self._offset)
777
778 - def close(self):
779 return self._handle.close()
780 781 782 #This is a generator function!
783 -def SffIterator(handle, alphabet=Alphabet.generic_dna, trim=False):
784 """Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects). 785 786 handle - input file, an SFF file, e.g. from Roche 454 sequencing. 787 This must NOT be opened in universal read lines mode! 788 alphabet - optional alphabet, defaults to generic DNA. 789 trim - should the sequences be trimmed? 790 791 The resulting SeqRecord objects should match those from a paired FASTA 792 and QUAL file converted from the SFF file using the Roche 454 tool 793 ssfinfo. i.e. The sequence will be mixed case, with the trim regions 794 shown in lower case. 795 796 This function is used internally via the Bio.SeqIO functions: 797 798 >>> from Bio import SeqIO 799 >>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"): 800 ... print("%s %i" % (record.id, len(record))) 801 E3MFGYR02JWQ7T 265 802 E3MFGYR02JA6IL 271 803 E3MFGYR02JHD4H 310 804 E3MFGYR02GFKUC 299 805 E3MFGYR02FTGED 281 806 E3MFGYR02FR9G7 261 807 E3MFGYR02GAZMS 278 808 E3MFGYR02HHZ8O 221 809 E3MFGYR02GPGB1 269 810 E3MFGYR02F7Z7G 219 811 812 You can also call it directly: 813 814 >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle: 815 ... for record in SffIterator(handle): 816 ... print("%s %i" % (record.id, len(record))) 817 E3MFGYR02JWQ7T 265 818 E3MFGYR02JA6IL 271 819 E3MFGYR02JHD4H 310 820 E3MFGYR02GFKUC 299 821 E3MFGYR02FTGED 281 822 E3MFGYR02FR9G7 261 823 E3MFGYR02GAZMS 278 824 E3MFGYR02HHZ8O 221 825 E3MFGYR02GPGB1 269 826 E3MFGYR02F7Z7G 219 827 828 Or, with the trim option: 829 830 >>> with open("Roche/E3MFGYR02_random_10_reads.sff", "rb") as handle: 831 ... for record in SffIterator(handle, trim=True): 832 ... print("%s %i" % (record.id, len(record))) 833 E3MFGYR02JWQ7T 260 834 E3MFGYR02JA6IL 265 835 E3MFGYR02JHD4H 292 836 E3MFGYR02GFKUC 295 837 E3MFGYR02FTGED 277 838 E3MFGYR02FR9G7 256 839 E3MFGYR02GAZMS 271 840 E3MFGYR02HHZ8O 150 841 E3MFGYR02GPGB1 221 842 E3MFGYR02F7Z7G 130 843 844 """ 845 if isinstance(Alphabet._get_base_alphabet(alphabet), 846 Alphabet.ProteinAlphabet): 847 raise ValueError("Invalid alphabet, SFF files do not hold proteins.") 848 if isinstance(Alphabet._get_base_alphabet(alphabet), 849 Alphabet.RNAAlphabet): 850 raise ValueError("Invalid alphabet, SFF files do not hold RNA.") 851 try: 852 assert 0 == handle.tell() 853 except AttributeError: 854 #Probably a network handle or something like that 855 handle = _AddTellHandle(handle) 856 header_length, index_offset, index_length, number_of_reads, \ 857 number_of_flows_per_read, flow_chars, key_sequence \ 858 = _sff_file_header(handle) 859 #Now on to the reads... 860 #the read header format (fixed part): 861 #read_header_length H 862 #name_length H 863 #seq_len I 864 #clip_qual_left H 865 #clip_qual_right H 866 #clip_adapter_left H 867 #clip_adapter_right H 868 #[rest of read header depends on the name length etc] 869 read_header_fmt = '>2HI4H' 870 read_header_size = struct.calcsize(read_header_fmt) 871 read_flow_fmt = ">%iH" % number_of_flows_per_read 872 read_flow_size = struct.calcsize(read_flow_fmt) 873 assert 1 == struct.calcsize(">B") 874 assert 1 == struct.calcsize(">s") 875 assert 1 == struct.calcsize(">c") 876 assert read_header_size % 8 == 0 # Important for padding calc later! 877 #The spec allows for the index block to be before or even in the middle 878 #of the reads. We can check that if we keep track of our position 879 #in the file... 880 for read in range(number_of_reads): 881 if index_offset and handle.tell() == index_offset: 882 offset = index_offset + index_length 883 if offset % 8: 884 offset += 8 - (offset % 8) 885 assert offset % 8 == 0 886 handle.seek(offset) 887 #Now that we've done this, we don't need to do it again. Clear 888 #the index_offset so we can skip extra handle.tell() calls: 889 index_offset = 0 890 yield _sff_read_seq_record(handle, 891 number_of_flows_per_read, 892 flow_chars, 893 key_sequence, 894 alphabet, 895 trim) 896 _check_eof(handle, index_offset, index_length)
897 898
899 -def _check_eof(handle, index_offset, index_length):
900 """Check final padding is OK (8 byte alignment) and file ends (PRIVATE). 901 902 Will attempt to spot apparent SFF file concatenation and give an error. 903 904 Will not attempt to seek, only moves the handle forward. 905 """ 906 offset = handle.tell() 907 extra = b"" 908 padding = 0 909 910 if index_offset and offset <= index_offset: 911 # Index block then end of file... 912 if offset < index_offset: 913 raise ValueError("Gap of %i bytes after final record end %i, " 914 "before %i where index starts?" 915 % (index_offset - offset, offset, index_offset)) 916 # Doing read to jump the index rather than a seek 917 # in case this is a network handle or similar 918 handle.read(index_offset + index_length - offset) 919 offset = index_offset + index_length 920 assert offset == handle.tell(), \ 921 "Wanted %i, got %i, index is %i to %i" \ 922 % (offset, handle.tell(), index_offset, index_offset + index_length) 923 924 if offset % 8: 925 padding = 8 - (offset % 8) 926 extra = handle.read(padding) 927 928 if padding >= 4 and extra[-4:] == _sff: 929 #Seen this in one user supplied file, should have been 930 #four bytes of null padding but was actually .sff and 931 #the start of a new concatenated SFF file! 932 raise ValueError("Your SFF file is invalid, post index %i byte " 933 "null padding region ended '.sff' which could " 934 "be the start of a concatenated SFF file? " 935 "See offset %i" % (padding, offset)) 936 if extra.count(_null) != padding: 937 import warnings 938 from Bio import BiopythonParserWarning 939 warnings.warn("Your SFF file is invalid, post index %i byte " 940 "null padding region contained data." % padding, 941 BiopythonParserWarning) 942 943 offset = handle.tell() 944 assert offset % 8 == 0 945 # Should now be at the end of the file... 946 extra = handle.read(4) 947 if extra == _sff: 948 raise ValueError("Additional data at end of SFF file, " 949 "perhaps multiple SFF files concatenated? " 950 "See offset %i" % offset) 951 elif extra: 952 raise ValueError("Additional data at end of SFF file, " 953 "see offset %i" % offset)
954 955 956 #This is a generator function!
957 -def _SffTrimIterator(handle, alphabet=Alphabet.generic_dna):
958 """Iterate over SFF reads (as SeqRecord objects) with trimming (PRIVATE).""" 959 return SffIterator(handle, alphabet, trim=True)
960 961
962 -class SffWriter(SequenceWriter):
963 """SFF file writer.""" 964
965 - def __init__(self, handle, index=True, xml=None):
966 """Creates the writer object. 967 968 handle - Output handle, ideally in binary write mode. 969 index - Boolean argument, should we try and write an index? 970 xml - Optional string argument, xml manifest to be recorded in the index 971 block (see function ReadRocheXmlManifest for reading this data). 972 """ 973 if hasattr(handle, "mode") and "U" in handle.mode.upper(): 974 raise ValueError("SFF files must NOT be opened in universal new " 975 "lines mode. Binary mode is required") 976 elif hasattr(handle, "mode") and "B" not in handle.mode.upper(): 977 raise ValueError("SFF files must be opened in binary mode") 978 self.handle = handle 979 self._xml = xml 980 if index: 981 self._index = [] 982 else: 983 self._index = None
984
985 - def write_file(self, records):
986 """Use this to write an entire file containing the given records.""" 987 try: 988 self._number_of_reads = len(records) 989 except TypeError: 990 self._number_of_reads = 0 # dummy value 991 if not hasattr(self.handle, "seek") \ 992 or not hasattr(self.handle, "tell"): 993 raise ValueError("A handle with a seek/tell methods is " 994 "required in order to record the total " 995 "record count in the file header (once it " 996 "is known at the end).") 997 if self._index is not None and \ 998 not (hasattr(self.handle, "seek") and hasattr(self.handle, "tell")): 999 import warnings 1000 warnings.warn("A handle with a seek/tell methods is required in " 1001 "order to record an SFF index.") 1002 self._index = None 1003 self._index_start = 0 1004 self._index_length = 0 1005 if not hasattr(records, "next"): 1006 records = iter(records) 1007 #Get the first record in order to find the flow information 1008 #we will need for the header. 1009 try: 1010 record = next(records) 1011 except StopIteration: 1012 record = None 1013 if record is None: 1014 #No records -> empty SFF file (or an error)? 1015 #We can't write a header without the flow information. 1016 #return 0 1017 raise ValueError("Must have at least one sequence") 1018 try: 1019 self._key_sequence = _as_bytes(record.annotations["flow_key"]) 1020 self._flow_chars = _as_bytes(record.annotations["flow_chars"]) 1021 self._number_of_flows_per_read = len(self._flow_chars) 1022 except KeyError: 1023 raise ValueError("Missing SFF flow information") 1024 self.write_header() 1025 self.write_record(record) 1026 count = 1 1027 for record in records: 1028 self.write_record(record) 1029 count += 1 1030 if self._number_of_reads == 0: 1031 #Must go back and record the record count... 1032 offset = self.handle.tell() 1033 self.handle.seek(0) 1034 self._number_of_reads = count 1035 self.write_header() 1036 self.handle.seek(offset) # not essential? 1037 else: 1038 assert count == self._number_of_reads 1039 if self._index is not None: 1040 self._write_index() 1041 return count
1042
1043 - def _write_index(self):
1044 assert len(self._index) == self._number_of_reads 1045 handle = self.handle 1046 self._index.sort() 1047 self._index_start = handle.tell() # need for header 1048 #XML... 1049 if self._xml is not None: 1050 xml = _as_bytes(self._xml) 1051 else: 1052 from Bio import __version__ 1053 xml = "<!-- This file was output with Biopython %s -->\n" % __version__ 1054 xml += "<!-- This XML and index block attempts to mimic Roche SFF files -->\n" 1055 xml += "<!-- This file may be a combination of multiple SFF files etc -->\n" 1056 xml = _as_bytes(xml) 1057 xml_len = len(xml) 1058 #Write to the file... 1059 fmt = ">I4BLL" 1060 fmt_size = struct.calcsize(fmt) 1061 handle.write(_null * fmt_size + xml) # fill this later 1062 fmt2 = ">6B" 1063 assert 6 == struct.calcsize(fmt2) 1064 self._index.sort() 1065 index_len = 0 # don't know yet! 1066 for name, offset in self._index: 1067 #Roche files record the offsets using base 255 not 256. 1068 #See comments for parsing the index block. There may be a faster 1069 #way to code this, but we can't easily use shifts due to odd base 1070 off3 = offset 1071 off0 = off3 % 255 1072 off3 -= off0 1073 off1 = off3 % 65025 1074 off3 -= off1 1075 off2 = off3 % 16581375 1076 off3 -= off2 1077 assert offset == off0 + off1 + off2 + off3, \ 1078 "%i -> %i %i %i %i" % (offset, off0, off1, off2, off3) 1079 off3, off2, off1, off0 = off3 // 16581375, off2 // 65025, \ 1080 off1 // 255, off0 1081 assert off0 < 255 and off1 < 255 and off2 < 255 and off3 < 255, \ 1082 "%i -> %i %i %i %i" % (offset, off0, off1, off2, off3) 1083 handle.write(name + struct.pack(fmt2, 0, 1084 off3, off2, off1, off0, 255)) 1085 index_len += len(name) + 6 1086 #Note any padding in not included: 1087 self._index_length = fmt_size + xml_len + index_len # need for header 1088 #Pad out to an 8 byte boundary (although I have noticed some 1089 #real Roche SFF files neglect to do this depsite their manual 1090 #suggesting this padding should be there): 1091 if self._index_length % 8: 1092 padding = 8 - (self._index_length % 8) 1093 handle.write(_null * padding) 1094 else: 1095 padding = 0 1096 offset = handle.tell() 1097 assert offset == self._index_start + self._index_length + padding, \ 1098 "%i vs %i + %i + %i" % (offset, self._index_start, 1099 self._index_length, padding) 1100 #Must now go back and update the index header with index size... 1101 handle.seek(self._index_start) 1102 handle.write(struct.pack(fmt, 778921588, # magic number 1103 49, 46, 48, 48, # Roche index version, "1.00" 1104 xml_len, index_len) + xml) 1105 #Must now go back and update the header... 1106 handle.seek(0) 1107 self.write_header() 1108 handle.seek(offset) # not essential?
1109
1110 - def write_header(self):
1111 #Do header... 1112 key_length = len(self._key_sequence) 1113 #file header (part one) 1114 #use big endiean encdoing > 1115 #magic_number I 1116 #version 4B 1117 #index_offset Q 1118 #index_length I 1119 #number_of_reads I 1120 #header_length H 1121 #key_length H 1122 #number_of_flows_per_read H 1123 #flowgram_format_code B 1124 #[rest of file header depends on the number of flows and how many keys] 1125 fmt = '>I4BQIIHHHB%is%is' % ( 1126 self._number_of_flows_per_read, key_length) 1127 #According to the spec, the header_length field should be the total 1128 #number of bytes required by this set of header fields, and should be 1129 #equal to "31 + number_of_flows_per_read + key_length" rounded up to 1130 #the next value divisible by 8. 1131 if struct.calcsize(fmt) % 8 == 0: 1132 padding = 0 1133 else: 1134 padding = 8 - (struct.calcsize(fmt) % 8) 1135 header_length = struct.calcsize(fmt) + padding 1136 assert header_length % 8 == 0 1137 header = struct.pack(fmt, 779314790, # magic number 0x2E736666 1138 0, 0, 0, 1, # version 1139 self._index_start, self._index_length, 1140 self._number_of_reads, 1141 header_length, key_length, 1142 self._number_of_flows_per_read, 1143 1, # the only flowgram format code we support 1144 self._flow_chars, self._key_sequence) 1145 self.handle.write(header + _null * padding)
1146
1147 - def write_record(self, record):
1148 """Write a single additional record to the output file. 1149 1150 This assumes the header has been done. 1151 """ 1152 #Basics 1153 name = _as_bytes(record.id) 1154 name_len = len(name) 1155 seq = _as_bytes(str(record.seq).upper()) 1156 seq_len = len(seq) 1157 #Qualities 1158 try: 1159 quals = record.letter_annotations["phred_quality"] 1160 except KeyError: 1161 raise ValueError("Missing PHRED qualities information") 1162 #Flow 1163 try: 1164 flow_values = record.annotations["flow_values"] 1165 flow_index = record.annotations["flow_index"] 1166 if self._key_sequence != _as_bytes(record.annotations["flow_key"]) \ 1167 or self._flow_chars != _as_bytes(record.annotations["flow_chars"]): 1168 raise ValueError("Records have inconsistent SFF flow data") 1169 except KeyError: 1170 raise ValueError("Missing SFF flow information") 1171 except AttributeError: 1172 raise ValueError("Header not written yet?") 1173 #Clipping 1174 try: 1175 clip_qual_left = record.annotations["clip_qual_left"] 1176 if clip_qual_left: 1177 clip_qual_left += 1 1178 clip_qual_right = record.annotations["clip_qual_right"] 1179 clip_adapter_left = record.annotations["clip_adapter_left"] 1180 if clip_adapter_left: 1181 clip_adapter_left += 1 1182 clip_adapter_right = record.annotations["clip_adapter_right"] 1183 except KeyError: 1184 raise ValueError("Missing SFF clipping information") 1185 1186 #Capture information for index 1187 if self._index is not None: 1188 offset = self.handle.tell() 1189 #Check the position of the final record (before sort by name) 1190 #Using a four-digit base 255 number, so the upper bound is 1191 #254*(1)+254*(255)+254*(255**2)+254*(255**3) = 4228250624 1192 #or equivalently it overflows at 255**4 = 4228250625 1193 if offset > 4228250624: 1194 import warnings 1195 warnings.warn("Read %s has file offset %i, which is too large " 1196 "to store in the Roche SFF index structure. No " 1197 "index block will be recorded." % (name, offset)) 1198 #No point recoring the offsets now 1199 self._index = None 1200 else: 1201 self._index.append((name, self.handle.tell())) 1202 1203 #the read header format (fixed part): 1204 #read_header_length H 1205 #name_length H 1206 #seq_len I 1207 #clip_qual_left H 1208 #clip_qual_right H 1209 #clip_adapter_left H 1210 #clip_adapter_right H 1211 #[rest of read header depends on the name length etc] 1212 #name 1213 #flow values 1214 #flow index 1215 #sequence 1216 #padding 1217 read_header_fmt = '>2HI4H%is' % name_len 1218 if struct.calcsize(read_header_fmt) % 8 == 0: 1219 padding = 0 1220 else: 1221 padding = 8 - (struct.calcsize(read_header_fmt) % 8) 1222 read_header_length = struct.calcsize(read_header_fmt) + padding 1223 assert read_header_length % 8 == 0 1224 data = struct.pack(read_header_fmt, 1225 read_header_length, 1226 name_len, seq_len, 1227 clip_qual_left, clip_qual_right, 1228 clip_adapter_left, clip_adapter_right, 1229 name) + _null * padding 1230 assert len(data) == read_header_length 1231 #now the flowgram values, flowgram index, bases and qualities 1232 #NOTE - assuming flowgram_format==1, which means struct type H 1233 read_flow_fmt = ">%iH" % self._number_of_flows_per_read 1234 read_flow_size = struct.calcsize(read_flow_fmt) 1235 temp_fmt = ">%iB" % seq_len # used for flow index and quals 1236 data += struct.pack(read_flow_fmt, *flow_values) \ 1237 + struct.pack(temp_fmt, *flow_index) \ 1238 + seq \ 1239 + struct.pack(temp_fmt, *quals) 1240 #now any final padding... 1241 padding = (read_flow_size + seq_len * 3) % 8 1242 if padding: 1243 padding = 8 - padding 1244 self.handle.write(data + _null * padding)
1245 1246 1247 if __name__ == "__main__": 1248 print("Running quick self test") 1249 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.sff" 1250 metadata = ReadRocheXmlManifest(open(filename, "rb")) 1251 index1 = sorted(_sff_read_roche_index(open(filename, "rb"))) 1252 index2 = sorted(_sff_do_slow_index(open(filename, "rb"))) 1253 assert index1 == index2 1254 assert len(index1) == len(list(SffIterator(open(filename, "rb")))) 1255 from Bio._py3k import StringIO 1256 from io import BytesIO 1257 assert len(index1) == len( 1258 list(SffIterator(BytesIO(open(filename, "rb").read())))) 1259 1260 if sys.platform != "win32": 1261 assert len(index1) == len(list(SffIterator(open(filename, "r")))) 1262 index2 = sorted(_sff_read_roche_index(open(filename))) 1263 assert index1 == index2 1264 index2 = sorted(_sff_do_slow_index(open(filename))) 1265 assert index1 == index2 1266 assert len(index1) == len(list(SffIterator(open(filename)))) 1267 assert len(index1) == len( 1268 list(SffIterator(BytesIO(open(filename, "r").read())))) 1269 assert len( 1270 index1) == len(list(SffIterator(BytesIO(open(filename).read())))) 1271 1272 sff = list(SffIterator(open(filename, "rb"))) 1273 1274 sff2 = list(SffIterator( 1275 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1276 assert len(sff) == len(sff2) 1277 for old, new in zip(sff, sff2): 1278 assert old.id == new.id 1279 assert str(old.seq) == str(new.seq) 1280 1281 sff2 = list(SffIterator( 1282 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1283 assert len(sff) == len(sff2) 1284 for old, new in zip(sff, sff2): 1285 assert old.id == new.id 1286 assert str(old.seq) == str(new.seq) 1287 1288 sff2 = list(SffIterator( 1289 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1290 assert len(sff) == len(sff2) 1291 for old, new in zip(sff, sff2): 1292 assert old.id == new.id 1293 assert str(old.seq) == str(new.seq) 1294 1295 sff2 = list(SffIterator( 1296 open("../../Tests/Roche/E3MFGYR02_index_at_start.sff", "rb"))) 1297 assert len(sff) == len(sff2) 1298 for old, new in zip(sff, sff2): 1299 assert old.id == new.id 1300 assert str(old.seq) == str(new.seq) 1301 1302 sff2 = list(SffIterator( 1303 open("../../Tests/Roche/E3MFGYR02_index_in_middle.sff", "rb"))) 1304 assert len(sff) == len(sff2) 1305 for old, new in zip(sff, sff2): 1306 assert old.id == new.id 1307 assert str(old.seq) == str(new.seq) 1308 1309 sff_trim = list(SffIterator(open(filename, "rb"), trim=True)) 1310 1311 print(ReadRocheXmlManifest(open(filename, "rb"))) 1312 1313 from Bio import SeqIO 1314 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads_no_trim.fasta" 1315 fasta_no_trim = list(SeqIO.parse(open(filename, "rU"), "fasta")) 1316 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads_no_trim.qual" 1317 qual_no_trim = list(SeqIO.parse(open(filename, "rU"), "qual")) 1318 1319 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.fasta" 1320 fasta_trim = list(SeqIO.parse(open(filename, "rU"), "fasta")) 1321 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.qual" 1322 qual_trim = list(SeqIO.parse(open(filename, "rU"), "qual")) 1323 1324 for s, sT, f, q, fT, qT in zip(sff, sff_trim, fasta_no_trim, 1325 qual_no_trim, fasta_trim, qual_trim): 1326 #print("") 1327 print(s.id) 1328 #print(s.seq) 1329 #print(s.letter_annotations["phred_quality"]) 1330 1331 assert s.id == f.id == q.id 1332 assert str(s.seq) == str(f.seq) 1333 assert s.letter_annotations[ 1334 "phred_quality"] == q.letter_annotations["phred_quality"] 1335 1336 assert s.id == sT.id == fT.id == qT.id 1337 assert str(sT.seq) == str(fT.seq) 1338 assert sT.letter_annotations[ 1339 "phred_quality"] == qT.letter_annotations["phred_quality"] 1340 1341 print("Writing with a list of SeqRecords...") 1342 handle = StringIO() 1343 w = SffWriter(handle, xml=metadata) 1344 w.write_file(sff) # list 1345 data = handle.getvalue() 1346 print("And again with an iterator...") 1347 handle = StringIO() 1348 w = SffWriter(handle, xml=metadata) 1349 w.write_file(iter(sff)) 1350 assert data == handle.getvalue() 1351 #Check 100% identical to the original: 1352 filename = "../../Tests/Roche/E3MFGYR02_random_10_reads.sff" 1353 with open(filename, "rb").read() as original: 1354 assert len(data) == len(original) 1355 assert data == original 1356 del data 1357 1358 print("-" * 50) 1359 filename = "../../Tests/Roche/greek.sff" 1360 for record in SffIterator(open(filename, "rb")): 1361 print(record.id) 1362 index1 = sorted(_sff_read_roche_index(open(filename, "rb"))) 1363 index2 = sorted(_sff_do_slow_index(open(filename, "rb"))) 1364 assert index1 == index2 1365 try: 1366 print(ReadRocheXmlManifest(open(filename, "rb"))) 1367 assert False, "Should fail!" 1368 except ValueError: 1369 pass 1370 1371 with open(filename, "rb") as handle: 1372 for record in SffIterator(handle): 1373 pass 1374 try: 1375 for record in SffIterator(handle): 1376 print(record.id) 1377 assert False, "Should have failed" 1378 except ValueError as err: 1379 print("Checking what happens on re-reading a handle:") 1380 print(err) 1381 1382 """ 1383 #Ugly code to make test files... 1384 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1385 padding = len(index)%8 1386 if padding: 1387 padding = 8 - padding 1388 index += chr(0)*padding 1389 assert len(index)%8 == 0 1390 1391 #Ugly bit of code to make a fake index at start 1392 records = list(SffIterator( 1393 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1394 out_handle = open( 1395 "../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "w") 1396 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1397 padding = len(index)%8 1398 if padding: 1399 padding = 8 - padding 1400 index += chr(0)*padding 1401 w = SffWriter(out_handle, index=False, xml=None) 1402 #Fake the header... 1403 w._number_of_reads = len(records) 1404 w._index_start = 0 1405 w._index_length = 0 1406 w._key_sequence = records[0].annotations["flow_key"] 1407 w._flow_chars = records[0].annotations["flow_chars"] 1408 w._number_of_flows_per_read = len(w._flow_chars) 1409 w.write_header() 1410 w._index_start = out_handle.tell() 1411 w._index_length = len(index) 1412 out_handle.seek(0) 1413 w.write_header() #this time with index info 1414 w.handle.write(index) 1415 for record in records: 1416 w.write_record(record) 1417 out_handle.close() 1418 records2 = list(SffIterator( 1419 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1420 for old, new in zip(records, records2): 1421 assert str(old.seq)==str(new.seq) 1422 i = list(_sff_do_slow_index( 1423 open("../../Tests/Roche/E3MFGYR02_alt_index_at_start.sff", "rb"))) 1424 1425 #Ugly bit of code to make a fake index in middle 1426 records = list(SffIterator( 1427 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1428 out_handle = open( 1429 "../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "w") 1430 index = ".diy1.00This is a fake index block (DIY = Do It Yourself), which is allowed under the SFF standard.\0" 1431 padding = len(index)%8 1432 if padding: 1433 padding = 8 - padding 1434 index += chr(0)*padding 1435 w = SffWriter(out_handle, index=False, xml=None) 1436 #Fake the header... 1437 w._number_of_reads = len(records) 1438 w._index_start = 0 1439 w._index_length = 0 1440 w._key_sequence = records[0].annotations["flow_key"] 1441 w._flow_chars = records[0].annotations["flow_chars"] 1442 w._number_of_flows_per_read = len(w._flow_chars) 1443 w.write_header() 1444 for record in records[:5]: 1445 w.write_record(record) 1446 w._index_start = out_handle.tell() 1447 w._index_length = len(index) 1448 w.handle.write(index) 1449 for record in records[5:]: 1450 w.write_record(record) 1451 out_handle.seek(0) 1452 w.write_header() #this time with index info 1453 out_handle.close() 1454 records2 = list(SffIterator( 1455 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1456 for old, new in zip(records, records2): 1457 assert str(old.seq)==str(new.seq) 1458 j = list(_sff_do_slow_index( 1459 open("../../Tests/Roche/E3MFGYR02_alt_index_in_middle.sff", "rb"))) 1460 1461 #Ugly bit of code to make a fake index at end 1462 records = list(SffIterator( 1463 open("../../Tests/Roche/E3MFGYR02_random_10_reads.sff", "rb"))) 1464 with open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "w") as out_handle: 1465 w = SffWriter(out_handle, index=False, xml=None) 1466 #Fake the header... 1467 w._number_of_reads = len(records) 1468 w._index_start = 0 1469 w._index_length = 0 1470 w._key_sequence = records[0].annotations["flow_key"] 1471 w._flow_chars = records[0].annotations["flow_chars"] 1472 w._number_of_flows_per_read = len(w._flow_chars) 1473 w.write_header() 1474 for record in records: 1475 w.write_record(record) 1476 w._index_start = out_handle.tell() 1477 w._index_length = len(index) 1478 out_handle.write(index) 1479 out_handle.seek(0) 1480 w.write_header() #this time with index info 1481 records2 = list(SffIterator( 1482 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1483 for old, new in zip(records, records2): 1484 assert str(old.seq)==str(new.seq) 1485 try: 1486 print(ReadRocheXmlManifest( 1487 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1488 assert False, "Should fail!" 1489 except ValueError: 1490 pass 1491 k = list(_sff_do_slow_index( 1492 open("../../Tests/Roche/E3MFGYR02_alt_index_at_end.sff", "rb"))) 1493 """ 1494 1495 print("Done") 1496