Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  73  produce FASTQ files using the standard Sanger encoding. 
  74   
  75  You are expected to use this module via the Bio.SeqIO functions, with the 
  76  following format names: 
  77   
  78   - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  79   - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  80      offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  81      These can potentially hold PHRED scores from 0 to 93. 
  82   - "fastq-sanger" is an alias for "fastq". 
  83   - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  84      files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  85      scores from -5 to 62. 
  86   - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  87      PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  88      to 62. 
  89   
  90  We could potentially add support for "qual-solexa" meaning QUAL files which 
  91  contain Solexa scores, but thus far there isn't any reason to use such files. 
  92   
  93  For example, consider the following short FASTQ file:: 
  94   
  95      @EAS54_6_R1_2_1_413_324 
  96      CCCTTCTTGTCTTCAGCGTTTCTCC 
  97      + 
  98      ;;3;;;;;;;;;;;;7;;;;;;;88 
  99      @EAS54_6_R1_2_1_540_792 
 100      TTGGCAGGCCAAGGCCGATGGATCA 
 101      + 
 102      ;;;;;;;;;;;7;;;;;-;;;3;83 
 103      @EAS54_6_R1_2_1_443_348 
 104      GTTGCTTCTGGCGTGGGTGGGGGGG 
 105      + 
 106      ;;;;;;;;;;;9;7;;.7;393333 
 107   
 108  This contains three reads of length 25.  From the read length these were 
 109  probably originally from an early Solexa/Illumina sequencer but this file 
 110  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 111  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 112  "fastq" as the format name: 
 113   
 114      >>> from Bio import SeqIO 
 115      >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 116      ...     print("%s %s" % (record.id, record.seq)) 
 117      EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 118      EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 119      EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 120   
 121  The qualities are held as a list of integers in each record's annotation: 
 122   
 123      >>> print(record) 
 124      ID: EAS54_6_R1_2_1_443_348 
 125      Name: EAS54_6_R1_2_1_443_348 
 126      Description: EAS54_6_R1_2_1_443_348 
 127      Number of features: 0 
 128      Per letter annotation for: phred_quality 
 129      Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 130      >>> print(record.letter_annotations["phred_quality"]) 
 131      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 132   
 133  You can use the SeqRecord format method to show this in the QUAL format: 
 134   
 135      >>> print(record.format("qual")) 
 136      >EAS54_6_R1_2_1_443_348 
 137      26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 138      24 18 18 18 18 
 139      <BLANKLINE> 
 140   
 141  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 142   
 143      >>> print(record.format("fastq")) 
 144      @EAS54_6_R1_2_1_443_348 
 145      GTTGCTTCTGGCGTGGGTGGGGGGG 
 146      + 
 147      ;;;;;;;;;;;9;7;;.7;393333 
 148      <BLANKLINE> 
 149   
 150  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 151  of 64): 
 152   
 153      >>> print(record.format("fastq-illumina")) 
 154      @EAS54_6_R1_2_1_443_348 
 155      GTTGCTTCTGGCGTGGGTGGGGGGG 
 156      + 
 157      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 158      <BLANKLINE> 
 159   
 160  You can also get Biopython to convert the scores and show a Solexa style 
 161  FASTQ file: 
 162   
 163      >>> print(record.format("fastq-solexa")) 
 164      @EAS54_6_R1_2_1_443_348 
 165      GTTGCTTCTGGCGTGGGTGGGGGGG 
 166      + 
 167      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 168      <BLANKLINE> 
 169   
 170  Notice that this is actually the same output as above using "fastq-illumina" 
 171  as the format! The reason for this is all these scores are high enough that 
 172  the PHRED and Solexa scores are almost equal. The differences become apparent 
 173  for poor quality reads. See the functions solexa_quality_from_phred and 
 174  phred_quality_from_solexa for more details. 
 175   
 176  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 177  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 178   
 179      >>> sub_rec = record[5:15] 
 180      >>> print(sub_rec) 
 181      ID: EAS54_6_R1_2_1_443_348 
 182      Name: EAS54_6_R1_2_1_443_348 
 183      Description: EAS54_6_R1_2_1_443_348 
 184      Number of features: 0 
 185      Per letter annotation for: phred_quality 
 186      Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 187      >>> print(sub_rec.letter_annotations["phred_quality"]) 
 188      [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 189      >>> print(sub_rec.format("fastq")) 
 190      @EAS54_6_R1_2_1_443_348 
 191      TTCTGGCGTG 
 192      + 
 193      ;;;;;;9;7; 
 194      <BLANKLINE> 
 195   
 196  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 197   
 198      >>> from Bio import SeqIO 
 199      >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 200      >>> with open("Quality/temp.qual", "w") as out_handle: 
 201      ...     SeqIO.write(record_iterator, out_handle, "qual") 
 202      3 
 203   
 204  You can of course read in a QUAL file, such as the one we just created: 
 205   
 206      >>> from Bio import SeqIO 
 207      >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 208      ...     print("%s %s" % (record.id, record.seq)) 
 209      EAS54_6_R1_2_1_413_324 ????????????????????????? 
 210      EAS54_6_R1_2_1_540_792 ????????????????????????? 
 211      EAS54_6_R1_2_1_443_348 ????????????????????????? 
 212   
 213  Notice that QUAL files don't have a proper sequence present!  But the quality 
 214  information is there: 
 215   
 216      >>> print(record) 
 217      ID: EAS54_6_R1_2_1_443_348 
 218      Name: EAS54_6_R1_2_1_443_348 
 219      Description: EAS54_6_R1_2_1_443_348 
 220      Number of features: 0 
 221      Per letter annotation for: phred_quality 
 222      UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 223      >>> print(record.letter_annotations["phred_quality"]) 
 224      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 225   
 226  Just to keep things tidy, if you are following this example yourself, you can 
 227  delete this temporary file now: 
 228   
 229      >>> import os 
 230      >>> os.remove("Quality/temp.qual") 
 231   
 232  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 233  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 234  would have to read the two in separately and then combine the data.  However, 
 235  since this is such a common thing to want to do, there is a helper iterator 
 236  defined in this module that does this for you - PairedFastaQualIterator. 
 237   
 238  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 239  then a simple dictionary approach would work: 
 240   
 241      >>> from Bio import SeqIO 
 242      >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta")) 
 243      >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"): 
 244      ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 245   
 246  You can then access any record by its key, and get both the sequence and the 
 247  quality scores. 
 248   
 249      >>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) 
 250      @EAS54_6_R1_2_1_540_792 
 251      TTGGCAGGCCAAGGCCGATGGATCA 
 252      + 
 253      ;;;;;;;;;;;7;;;;;-;;;3;83 
 254      <BLANKLINE> 
 255   
 256  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 257  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 258  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 259  for the more recent variant), as this cannot be detected reliably 
 260  automatically. 
 261   
 262  To illustrate this problem, let's consider an artifical example: 
 263   
 264      >>> from Bio.Seq import Seq 
 265      >>> from Bio.Alphabet import generic_dna 
 266      >>> from Bio.SeqRecord import SeqRecord 
 267      >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 268      ... description="Made up!") 
 269      >>> print(test.format("fasta")) 
 270      >Test Made up! 
 271      NACGTACGTA 
 272      <BLANKLINE> 
 273      >>> print(test.format("fastq")) 
 274      Traceback (most recent call last): 
 275       ... 
 276      ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 277   
 278  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 279  or FASTQ format we need to provide some quality scores. These are held as a 
 280  list of integers (one for each base) in the letter_annotations dictionary: 
 281   
 282      >>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 283      >>> print(test.format("qual")) 
 284      >Test Made up! 
 285      0 1 2 3 4 5 10 20 30 40 
 286      <BLANKLINE> 
 287      >>> print(test.format("fastq")) 
 288      @Test Made up! 
 289      NACGTACGTA 
 290      + 
 291      !"#$%&+5?I 
 292      <BLANKLINE> 
 293   
 294  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 295  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 296  the letter "I": 
 297   
 298      >>> ord('!') - 33 
 299      0 
 300      >>> ord('I') - 33 
 301      40 
 302      >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 303      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 304   
 305  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 306  scores with an offset of 64: 
 307   
 308      >>> print(test.format("fastq-illumina")) 
 309      @Test Made up! 
 310      NACGTACGTA 
 311      + 
 312      @ABCDEJT^h 
 313      <BLANKLINE> 
 314   
 315  And we can check this too - the first PHRED score was zero, and this mapped to 
 316  "@", while the final score was 40 and this mapped to "h": 
 317   
 318      >>> ord("@") - 64 
 319      0 
 320      >>> ord("h") - 64 
 321      40 
 322      >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 323      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 324   
 325  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 326  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 327  format to consider which encodes Solexa scores instead of PHRED scores. 
 328   
 329  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 330  scores (rounding to one decimal place): 
 331   
 332      >>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: 
 333      ...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) 
 334      PHRED 0 maps to Solexa -5.0 
 335      PHRED 1 maps to Solexa -5.0 
 336      PHRED 2 maps to Solexa -2.3 
 337      PHRED 3 maps to Solexa -0.0 
 338      PHRED 4 maps to Solexa 1.8 
 339      PHRED 5 maps to Solexa 3.3 
 340      PHRED 10 maps to Solexa 9.5 
 341      PHRED 20 maps to Solexa 20.0 
 342      PHRED 30 maps to Solexa 30.0 
 343      PHRED 40 maps to Solexa 40.0 
 344   
 345  Now here is the record using the old Solexa style FASTQ file: 
 346   
 347      >>> print(test.format("fastq-solexa")) 
 348      @Test Made up! 
 349      NACGTACGTA 
 350      + 
 351      ;;>@BCJT^h 
 352      <BLANKLINE> 
 353   
 354  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 355   
 356      >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 357      [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 358   
 359  This explains why the last few letters of this FASTQ output matched that using 
 360  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 361  are approximately equal. 
 362   
 363  """ 
 364  from __future__ import print_function 
 365   
 366  __docformat__ = "epytext en"  # Don't just use plain text in epydoc API pages! 
 367   
 368  from Bio.Alphabet import single_letter_alphabet 
 369  from Bio.Seq import Seq, UnknownSeq 
 370  from Bio.SeqRecord import SeqRecord 
 371  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 372  from math import log 
 373  import warnings 
 374  from Bio import BiopythonWarning, BiopythonParserWarning 
 375   
 376   
 377  # define score offsets. See discussion for differences between Sanger and 
 378  # Solexa offsets. 
 379  SANGER_SCORE_OFFSET = 33 
 380  SOLEXA_SCORE_OFFSET = 64 
 381   
 382   
383 -def solexa_quality_from_phred(phred_quality):
384 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 385 386 PHRED and Solexa quality scores are both log transformations of a 387 probality of error (high score = low probability of error). This function 388 takes a PHRED score, transforms it back to a probability of error, and 389 then re-expresses it as a Solexa score. This assumes the error estimates 390 are equivalent. 391 392 How does this work exactly? Well the PHRED quality is minus ten times the 393 base ten logarithm of the probability of error:: 394 395 phred_quality = -10*log(error,10) 396 397 Therefore, turning this round:: 398 399 error = 10 ** (- phred_quality / 10) 400 401 Now, Solexa qualities use a different log transformation:: 402 403 solexa_quality = -10*log(error/(1-error),10) 404 405 After substitution and a little manipulation we get:: 406 407 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 408 409 However, real Solexa files use a minimum quality of -5. This does have a 410 good reason - a random base call would be correct 25% of the time, 411 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 412 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 413 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 414 415 Taken literally, this logarithic formula would map a PHRED quality of zero 416 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 417 score of zero means a probability of error of one (i.e. the base call is 418 definitely wrong), which is worse than random! In practice, a PHRED quality 419 of zero usually means a default value, or perhaps random - and therefore 420 mapping it to the minimum Solexa score of -5 is reasonable. 421 422 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 423 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 424 quality of zero to -5.0 as well. 425 426 Note this function will return a floating point number, it is up to you to 427 round this to the nearest integer if appropriate. e.g. 428 429 >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) 430 80.00 431 >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) 432 50.00 433 >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) 434 19.96 435 >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) 436 9.54 437 >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) 438 3.35 439 >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) 440 1.80 441 >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) 442 -0.02 443 >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) 444 -2.33 445 >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) 446 -5.00 447 >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) 448 -5.00 449 450 Notice that for high quality reads PHRED and Solexa scores are numerically 451 equal. The differences are important for poor quality reads, where PHRED 452 has a minimum of zero but Solexa scores can be negative. 453 454 Finally, as a special case where None is used for a "missing value", None 455 is returned: 456 457 >>> print(solexa_quality_from_phred(None)) 458 None 459 """ 460 if phred_quality is None: 461 #Assume None is used as some kind of NULL or NA value; return None 462 #e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 463 return None 464 elif phred_quality > 0: 465 #Solexa uses a minimum value of -5, which after rounding matches a 466 #random nucleotide base call. 467 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 468 elif phred_quality == 0: 469 #Special case, map to -5 as discussed in the docstring 470 return -5.0 471 else: 472 raise ValueError("PHRED qualities must be positive (or zero), not %s" 473 % repr(phred_quality))
474 475
476 -def phred_quality_from_solexa(solexa_quality):
477 """Convert a Solexa quality (which can be negative) to a PHRED quality. 478 479 PHRED and Solexa quality scores are both log transformations of a 480 probality of error (high score = low probability of error). This function 481 takes a Solexa score, transforms it back to a probability of error, and 482 then re-expresses it as a PHRED score. This assumes the error estimates 483 are equivalent. 484 485 The underlying formulas are given in the documentation for the sister 486 function solexa_quality_from_phred, in this case the operation is:: 487 488 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 489 490 This will return a floating point number, it is up to you to round this to 491 the nearest integer if appropriate. e.g. 492 493 >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) 494 80.00 495 >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) 496 20.04 497 >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) 498 10.41 499 >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) 500 3.01 501 >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) 502 1.19 503 504 Note that a solexa_quality less then -5 is not expected, will trigger a 505 warning, but will still be converted as per the logarithmic mapping 506 (giving a number between 0 and 1.19 back). 507 508 As a special case where None is used for a "missing value", None is 509 returned: 510 511 >>> print(phred_quality_from_solexa(None)) 512 None 513 """ 514 if solexa_quality is None: 515 #Assume None is used as some kind of NULL or NA value; return None 516 return None 517 if solexa_quality < -5: 518 warnings.warn("Solexa quality less than -5 passed, %s" 519 % repr(solexa_quality), BiopythonWarning) 520 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
521 522
523 -def _get_phred_quality(record):
524 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 525 526 If there are no PHRED qualities, but there are Solexa qualities, those are 527 used instead after conversion. 528 """ 529 try: 530 return record.letter_annotations["phred_quality"] 531 except KeyError: 532 pass 533 try: 534 return [phred_quality_from_solexa(q) for 535 q in record.letter_annotations["solexa_quality"]] 536 except KeyError: 537 raise ValueError("No suitable quality scores found in " 538 "letter_annotations of SeqRecord (id=%s)." 539 % record.id)
540 541 #Only map 0 to 93, we need to give a warning on truncating at 93 542 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp + SANGER_SCORE_OFFSET))) 543 for qp in range(0, 93 + 1)) 544 #Only map -5 to 93, we need to give a warning on truncating at 93 545 _solexa_to_sanger_quality_str = dict( 546 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs))) + 547 SANGER_SCORE_OFFSET))) 548 for qs in range(-5, 93 + 1)) 549 550
551 -def _get_sanger_quality_str(record):
552 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 553 554 >>> from Bio.Seq import Seq 555 >>> from Bio.SeqRecord import SeqRecord 556 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 557 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) 558 >>> _get_sanger_quality_str(r) 559 'SI?5+!' 560 561 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 562 the PHRED qualities are integers, this function is able to use a very fast 563 pre-cached mapping. However, if they are floats which differ slightly, then 564 it has to do the appropriate rounding - which is slower: 565 566 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 567 ... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) 568 >>> _get_sanger_quality_str(r2) 569 'SI?5+!' 570 571 If your scores include a None value, this raises an exception: 572 573 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 574 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) 575 >>> _get_sanger_quality_str(r3) 576 Traceback (most recent call last): 577 ... 578 TypeError: A quality value of None was found 579 580 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 581 scores are used in preference: 582 583 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 584 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], 585 ... "solexa_quality":[-5, -4, 0, None, 0, 40]}) 586 >>> _get_sanger_quality_str(r4) 587 'SI?5+!' 588 589 If there are no PHRED scores, but there are Solexa scores, these are used 590 instead (after the approriate conversion): 591 592 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 593 ... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) 594 >>> _get_sanger_quality_str(r5) 595 'I?5+$"' 596 597 Again, integer Solexa scores can be looked up in a pre-cached mapping making 598 this very fast. You can still use approximate floating point scores: 599 600 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 601 ... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) 602 >>> _get_sanger_quality_str(r6) 603 'I?5+$"' 604 605 Notice that due to the limited range of printable ASCII characters, a 606 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 607 file (using ASCII 126, the tilde). This function will issue a warning 608 in this situation. 609 """ 610 #TODO - This functions works and is fast, but it is also ugly 611 #and there is considerable repetition of code for the other 612 #two FASTQ variants. 613 try: 614 #These take priority (in case both Solexa and PHRED scores found) 615 qualities = record.letter_annotations["phred_quality"] 616 except KeyError: 617 #Fall back on solexa scores... 618 pass 619 else: 620 #Try and use the precomputed mapping: 621 try: 622 return "".join(_phred_to_sanger_quality_str[qp] 623 for qp in qualities) 624 except KeyError: 625 #Could be a float, or a None in the list, or a high value. 626 pass 627 if None in qualities: 628 raise TypeError("A quality value of None was found") 629 if max(qualities) >= 93.5: 630 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 631 BiopythonWarning) 632 #This will apply the truncation at 93, giving max ASCII 126 633 return "".join(chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) 634 for qp in qualities) 635 #Fall back on the Solexa scores... 636 try: 637 qualities = record.letter_annotations["solexa_quality"] 638 except KeyError: 639 raise ValueError("No suitable quality scores found in " 640 "letter_annotations of SeqRecord (id=%s)." 641 % record.id) 642 #Try and use the precomputed mapping: 643 try: 644 return "".join(_solexa_to_sanger_quality_str[qs] 645 for qs in qualities) 646 except KeyError: 647 #Either no PHRED scores, or something odd like a float or None 648 pass 649 if None in qualities: 650 raise TypeError("A quality value of None was found") 651 #Must do this the slow way, first converting the PHRED scores into 652 #Solexa scores: 653 if max(qualities) >= 93.5: 654 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 655 BiopythonWarning) 656 #This will apply the truncation at 93, giving max ASCII 126 657 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 658 for qs in qualities)
659 660 #Only map 0 to 62, we need to give a warning on truncating at 62 661 assert 62 + SOLEXA_SCORE_OFFSET == 126 662 _phred_to_illumina_quality_str = dict((qp, chr(qp + SOLEXA_SCORE_OFFSET)) 663 for qp in range(0, 62 + 1)) 664 #Only map -5 to 62, we need to give a warning on truncating at 62 665 _solexa_to_illumina_quality_str = dict( 666 (qs, chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 667 for qs in range(-5, 62 + 1)) 668 669
670 -def _get_illumina_quality_str(record):
671 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 672 673 Notice that due to the limited range of printable ASCII characters, a 674 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 675 file (using ASCII 126, the tilde). This function will issue a warning 676 in this situation. 677 """ 678 #TODO - This functions works and is fast, but it is also ugly 679 #and there is considerable repetition of code for the other 680 #two FASTQ variants. 681 try: 682 #These take priority (in case both Solexa and PHRED scores found) 683 qualities = record.letter_annotations["phred_quality"] 684 except KeyError: 685 #Fall back on solexa scores... 686 pass 687 else: 688 #Try and use the precomputed mapping: 689 try: 690 return "".join(_phred_to_illumina_quality_str[qp] 691 for qp in qualities) 692 except KeyError: 693 #Could be a float, or a None in the list, or a high value. 694 pass 695 if None in qualities: 696 raise TypeError("A quality value of None was found") 697 if max(qualities) >= 62.5: 698 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 699 BiopythonWarning) 700 #This will apply the truncation at 62, giving max ASCII 126 701 return "".join(chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) 702 for qp in qualities) 703 #Fall back on the Solexa scores... 704 try: 705 qualities = record.letter_annotations["solexa_quality"] 706 except KeyError: 707 raise ValueError("No suitable quality scores found in " 708 "letter_annotations of SeqRecord (id=%s)." 709 % record.id) 710 #Try and use the precomputed mapping: 711 try: 712 return "".join(_solexa_to_illumina_quality_str[qs] 713 for qs in qualities) 714 except KeyError: 715 #Either no PHRED scores, or something odd like a float or None 716 pass 717 if None in qualities: 718 raise TypeError("A quality value of None was found") 719 #Must do this the slow way, first converting the PHRED scores into 720 #Solexa scores: 721 if max(qualities) >= 62.5: 722 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 723 BiopythonWarning) 724 #This will apply the truncation at 62, giving max ASCII 126 725 return "".join(chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 726 for qs in qualities)
727 728 #Only map 0 to 62, we need to give a warning on truncating at 62 729 assert 62 + SOLEXA_SCORE_OFFSET == 126 730 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs + SOLEXA_SCORE_OFFSET))) 731 for qs in range(-5, 62 + 1)) 732 #Only map -5 to 62, we need to give a warning on truncating at 62 733 _phred_to_solexa_quality_str = dict( 734 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp))) + 735 SOLEXA_SCORE_OFFSET))) 736 for qp in range(0, 62 + 1)) 737 738
739 -def _get_solexa_quality_str(record):
740 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 741 742 Notice that due to the limited range of printable ASCII characters, a 743 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 744 file (using ASCII 126, the tilde). This function will issue a warning 745 in this situation. 746 """ 747 #TODO - This functions works and is fast, but it is also ugly 748 #and there is considerable repetition of code for the other 749 #two FASTQ variants. 750 try: 751 #These take priority (in case both Solexa and PHRED scores found) 752 qualities = record.letter_annotations["solexa_quality"] 753 except KeyError: 754 #Fall back on PHRED scores... 755 pass 756 else: 757 #Try and use the precomputed mapping: 758 try: 759 return "".join(_solexa_to_solexa_quality_str[qs] 760 for qs in qualities) 761 except KeyError: 762 #Could be a float, or a None in the list, or a high value. 763 pass 764 if None in qualities: 765 raise TypeError("A quality value of None was found") 766 if max(qualities) >= 62.5: 767 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 768 BiopythonWarning) 769 #This will apply the truncation at 62, giving max ASCII 126 770 return "".join(chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) 771 for qs in qualities) 772 #Fall back on the PHRED scores... 773 try: 774 qualities = record.letter_annotations["phred_quality"] 775 except KeyError: 776 raise ValueError("No suitable quality scores found in " 777 "letter_annotations of SeqRecord (id=%s)." 778 % record.id) 779 #Try and use the precomputed mapping: 780 try: 781 return "".join(_phred_to_solexa_quality_str[qp] 782 for qp in qualities) 783 except KeyError: 784 #Either no PHRED scores, or something odd like a float or None 785 #or too big to be in the cache 786 pass 787 if None in qualities: 788 raise TypeError("A quality value of None was found") 789 #Must do this the slow way, first converting the PHRED scores into 790 #Solexa scores: 791 if max(qualities) >= 62.5: 792 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 793 BiopythonWarning) 794 return "".join(chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 795 for qp in qualities)
796 797 798 #TODO - Default to nucleotide or even DNA?
799 -def FastqGeneralIterator(handle):
800 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 801 802 This code does not try to interpret the quality string numerically. It 803 just returns tuples of the title, sequence and quality as strings. For 804 the sequence and quality, any whitespace (such as new lines) is removed. 805 806 Our SeqRecord based FASTQ iterators call this function internally, and then 807 turn the strings into a SeqRecord objects, mapping the quality string into 808 a list of numerical scores. If you want to do a custom quality mapping, 809 then you might consider calling this function directly. 810 811 For parsing FASTQ files, the title string from the "@" line at the start 812 of each record can optionally be omitted on the "+" lines. If it is 813 repeated, it must be identical. 814 815 The sequence string and the quality string can optionally be split over 816 multiple lines, although several sources discourage this. In comparison, 817 for the FASTA file format line breaks between 60 and 80 characters are 818 the norm. 819 820 WARNING - Because the "@" character can appear in the quality string, 821 this can cause problems as this is also the marker for the start of 822 a new sequence. In fact, the "+" sign can also appear as well. Some 823 sources recommended having no line breaks in the quality to avoid this, 824 but even that is not enough, consider this example:: 825 826 @071113_EAS56_0053:1:1:998:236 827 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 828 +071113_EAS56_0053:1:1:998:236 829 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 830 @071113_EAS56_0053:1:1:182:712 831 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 832 + 833 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 834 @071113_EAS56_0053:1:1:153:10 835 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 836 + 837 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 838 @071113_EAS56_0053:1:3:990:501 839 TGGGAGGTTTTATGTGGA 840 AAGCAGCAATGTACAAGA 841 + 842 IIIIIII.IIIIII1@44 843 @-7.%<&+/$/%4(++(% 844 845 This is four PHRED encoded FASTQ entries originally from an NCBI source 846 (given the read length of 36, these are probably Solexa Illumna reads where 847 the quality has been mapped onto the PHRED values). 848 849 This example has been edited to illustrate some of the nasty things allowed 850 in the FASTQ format. Firstly, on the "+" lines most but not all of the 851 (redundant) identifiers are omitted. In real files it is likely that all or 852 none of these extra identifiers will be present. 853 854 Secondly, while the first three sequences have been shown without line 855 breaks, the last has been split over multiple lines. In real files any line 856 breaks are likely to be consistent. 857 858 Thirdly, some of the quality string lines start with an "@" character. For 859 the second record this is unavoidable. However for the fourth sequence this 860 only happens because its quality string is split over two lines. A naive 861 parser could wrongly treat any line starting with an "@" as the beginning of 862 a new sequence! This code copes with this possible ambiguity by keeping 863 track of the length of the sequence which gives the expected length of the 864 quality string. 865 866 Using this tricky example file as input, this short bit of code demonstrates 867 what this parsing function would return: 868 869 >>> with open("Quality/tricky.fastq", "rU") as handle: 870 ... for (title, sequence, quality) in FastqGeneralIterator(handle): 871 ... print(title) 872 ... print("%s %s" % (sequence, quality)) 873 ... 874 071113_EAS56_0053:1:1:998:236 875 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 876 071113_EAS56_0053:1:1:182:712 877 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 878 071113_EAS56_0053:1:1:153:10 879 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 880 071113_EAS56_0053:1:3:990:501 881 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 882 883 Finally we note that some sources state that the quality string should 884 start with "!" (which using the PHRED mapping means the first letter always 885 has a quality score of zero). This rather restrictive rule is not widely 886 observed, so is therefore ignored here. One plus point about this "!" rule 887 is that (provided there are no line breaks in the quality sequence) it 888 would prevent the above problem with the "@" character. 889 """ 890 #We need to call handle.readline() at least four times per record, 891 #so we'll save a property look up each time: 892 handle_readline = handle.readline 893 894 #Skip any text before the first record (e.g. blank lines, comments?) 895 while True: 896 line = handle_readline() 897 if not line: 898 return # Premature end of file, or just empty? 899 if line[0] == "@": 900 break 901 if isinstance(line[0], int): 902 raise ValueError("Is this handle in binary mode not text mode?") 903 904 while line: 905 if line[0] != "@": 906 raise ValueError( 907 "Records in Fastq files should start with '@' character") 908 title_line = line[1:].rstrip() 909 #Will now be at least one line of quality data - in most FASTQ files 910 #just one line! We therefore use string concatenation (if needed) 911 #rather using than the "".join(...) trick just in case it is multiline: 912 seq_string = handle_readline().rstrip() 913 #There may now be more sequence lines, or the "+" quality marker line: 914 while True: 915 line = handle_readline() 916 if not line: 917 raise ValueError("End of file without quality information.") 918 if line[0] == "+": 919 #The title here is optional, but if present must match! 920 second_title = line[1:].rstrip() 921 if second_title and second_title != title_line: 922 raise ValueError("Sequence and quality captions differ.") 923 break 924 seq_string += line.rstrip() # removes trailing newlines 925 #This is going to slow things down a little, but assuming 926 #this isn't allowed we should try and catch it here: 927 if " " in seq_string or "\t" in seq_string: 928 raise ValueError("Whitespace is not allowed in the sequence.") 929 seq_len = len(seq_string) 930 931 #Will now be at least one line of quality data... 932 quality_string = handle_readline().rstrip() 933 #There may now be more quality data, or another sequence, or EOF 934 while True: 935 line = handle_readline() 936 if not line: 937 break # end of file 938 if line[0] == "@": 939 #This COULD be the start of a new sequence. However, it MAY just 940 #be a line of quality data which starts with a "@" character. We 941 #should be able to check this by looking at the sequence length 942 #and the amount of quality data found so far. 943 if len(quality_string) >= seq_len: 944 #We expect it to be equal if this is the start of a new record. 945 #If the quality data is longer, we'll raise an error below. 946 break 947 #Continue - its just some (more) quality data. 948 quality_string += line.rstrip() 949 950 if seq_len != len(quality_string): 951 raise ValueError("Lengths of sequence and quality values differs " 952 " for %s (%i and %i)." 953 % (title_line, seq_len, len(quality_string))) 954 955 #Return the record and then continue... 956 yield (title_line, seq_string, quality_string) 957 raise StopIteration
958 959
960 -def FastqPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
961 """Generator function to iterate over FASTQ records (as SeqRecord objects). 962 963 - handle - input file 964 - alphabet - optional alphabet 965 - title2ids - A function that, when given the title line from the FASTQ 966 file (without the beginning >), will return the id, name and 967 description (in that order) for the record as a tuple of 968 strings. If this is not given, then the entire title line 969 will be used as the description, and the first word as the 970 id and name. 971 972 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 973 974 For each sequence in a (Sanger style) FASTQ file there is a matching string 975 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 976 values with an offset of 33. 977 978 For example, consider a file containing three short reads:: 979 980 @EAS54_6_R1_2_1_413_324 981 CCCTTCTTGTCTTCAGCGTTTCTCC 982 + 983 ;;3;;;;;;;;;;;;7;;;;;;;88 984 @EAS54_6_R1_2_1_540_792 985 TTGGCAGGCCAAGGCCGATGGATCA 986 + 987 ;;;;;;;;;;;7;;;;;-;;;3;83 988 @EAS54_6_R1_2_1_443_348 989 GTTGCTTCTGGCGTGGGTGGGGGGG 990 + 991 ;;;;;;;;;;;9;7;;.7;393333 992 993 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 994 string encoding the PHRED qualities using a ASCI values with an offset of 995 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 996 997 Using this module directly you might run: 998 999 >>> with open("Quality/example.fastq", "rU") as handle: 1000 ... for record in FastqPhredIterator(handle): 1001 ... print("%s %s" % (record.id, record.seq)) 1002 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1003 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1004 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1005 1006 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1007 (or "fastq-sanger") as the format: 1008 1009 >>> from Bio import SeqIO 1010 >>> with open("Quality/example.fastq", "rU") as handle: 1011 ... for record in SeqIO.parse(handle, "fastq"): 1012 ... print("%s %s" % (record.id, record.seq)) 1013 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1014 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1015 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1016 1017 If you want to look at the qualities, they are record in each record's 1018 per-letter-annotation dictionary as a simple list of integers: 1019 1020 >>> print(record.letter_annotations["phred_quality"]) 1021 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1022 1023 """ 1024 assert SANGER_SCORE_OFFSET == ord("!") 1025 #Originally, I used a list expression for each record: 1026 # 1027 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1028 # 1029 #Precomputing is faster, perhaps partly by avoiding the subtractions. 1030 q_mapping = dict() 1031 for letter in range(0, 255): 1032 q_mapping[chr(letter)] = letter - SANGER_SCORE_OFFSET 1033 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1034 if title2ids: 1035 id, name, descr = title2ids(title_line) 1036 else: 1037 descr = title_line 1038 id = descr.split()[0] 1039 name = id 1040 record = SeqRecord(Seq(seq_string, alphabet), 1041 id=id, name=name, description=descr) 1042 qualities = [q_mapping[letter] for letter in quality_string] 1043 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1044 raise ValueError("Invalid character in quality string") 1045 #For speed, will now use a dirty trick to speed up assigning the 1046 #qualities. We do this to bypass the length check imposed by the 1047 #per-letter-annotations restricted dict (as this has already been 1048 #checked by FastqGeneralIterator). This is equivalent to: 1049 #record.letter_annotations["phred_quality"] = qualities 1050 dict.__setitem__(record._per_letter_annotations, 1051 "phred_quality", qualities) 1052 yield record
1053 1054
1055 -def FastqSolexaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1056 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1057 1058 The optional arguments are the same as those for the FastqPhredIterator. 1059 1060 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1061 encoding the Solexa integer qualities using ASCII values with an offset 1062 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1063 will NOT perform any automatic conversion when loading. 1064 1065 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1066 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1067 1068 For example, consider a file containing these five records:: 1069 1070 @SLXA-B3_649_FC8437_R1_1_1_610_79 1071 GATGTGCAATACCTTTGTAGAGGAA 1072 +SLXA-B3_649_FC8437_R1_1_1_610_79 1073 YYYYYYYYYYYYYYYYYYWYWYYSU 1074 @SLXA-B3_649_FC8437_R1_1_1_397_389 1075 GGTTTGAGAAAGAGAAATGAGATAA 1076 +SLXA-B3_649_FC8437_R1_1_1_397_389 1077 YYYYYYYYYWYYYYWWYYYWYWYWW 1078 @SLXA-B3_649_FC8437_R1_1_1_850_123 1079 GAGGGTGTTGATCATGATGATGGCG 1080 +SLXA-B3_649_FC8437_R1_1_1_850_123 1081 YYYYYYYYYYYYYWYYWYYSYYYSY 1082 @SLXA-B3_649_FC8437_R1_1_1_362_549 1083 GGAAACAAAGTTTTTCTCAACATAG 1084 +SLXA-B3_649_FC8437_R1_1_1_362_549 1085 YYYYYYYYYYYYYYYYYYWWWWYWY 1086 @SLXA-B3_649_FC8437_R1_1_1_183_714 1087 GTATTATTTAATGGCATACACTCAA 1088 +SLXA-B3_649_FC8437_R1_1_1_183_714 1089 YYYYYYYYYYWYYYYWYWWUWWWQQ 1090 1091 Using this module directly you might run: 1092 1093 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1094 ... for record in FastqSolexaIterator(handle): 1095 ... print("%s %s" % (record.id, record.seq)) 1096 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1097 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1098 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1099 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1100 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1101 1102 Typically however, you would call this via Bio.SeqIO instead with 1103 "fastq-solexa" as the format: 1104 1105 >>> from Bio import SeqIO 1106 >>> with open("Quality/solexa_example.fastq", "rU") as handle: 1107 ... for record in SeqIO.parse(handle, "fastq-solexa"): 1108 ... print("%s %s" % (record.id, record.seq)) 1109 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1110 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1111 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1112 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1113 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1114 1115 If you want to look at the qualities, they are recorded in each record's 1116 per-letter-annotation dictionary as a simple list of integers: 1117 1118 >>> print(record.letter_annotations["solexa_quality"]) 1119 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1120 1121 These scores aren't very good, but they are high enough that they map 1122 almost exactly onto PHRED scores: 1123 1124 >>> print("%0.2f" % phred_quality_from_solexa(25)) 1125 25.01 1126 1127 Let's look at faked example read which is even worse, where there are 1128 more noticeable differences between the Solexa and PHRED scores:: 1129 1130 @slxa_0001_1_0001_01 1131 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1132 +slxa_0001_1_0001_01 1133 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1134 1135 Again, you would typically use Bio.SeqIO to read this file in (rather than 1136 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1137 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1138 as shown above. This example has only as one entry, so instead we can 1139 use the Bio.SeqIO.read() function: 1140 1141 >>> from Bio import SeqIO 1142 >>> with open("Quality/solexa_faked.fastq", "rU") as handle: 1143 ... record = SeqIO.read(handle, "fastq-solexa") 1144 >>> print("%s %s" % (record.id, record.seq)) 1145 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1146 >>> print(record.letter_annotations["solexa_quality"]) 1147 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1148 1149 These quality scores are so low that when converted from the Solexa scheme 1150 into PHRED scores they look quite different: 1151 1152 >>> print("%0.2f" % phred_quality_from_solexa(-1)) 1153 2.54 1154 >>> print("%0.2f" % phred_quality_from_solexa(-5)) 1155 1.19 1156 1157 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1158 method to output the record(s): 1159 1160 >>> print(record.format("fastq-solexa")) 1161 @slxa_0001_1_0001_01 1162 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1163 + 1164 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1165 <BLANKLINE> 1166 1167 Note this output is slightly different from the input file as Biopython 1168 has left out the optional repetition of the sequence identifier on the "+" 1169 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1170 output format instead, and Biopython will do the conversion for you: 1171 1172 >>> print(record.format("fastq")) 1173 @slxa_0001_1_0001_01 1174 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1175 + 1176 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1177 <BLANKLINE> 1178 1179 >>> print(record.format("qual")) 1180 >slxa_0001_1_0001_01 1181 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1182 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1183 1 1 1184 <BLANKLINE> 1185 1186 As shown above, the poor quality Solexa reads have been mapped to the 1187 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1188 """ 1189 q_mapping = dict() 1190 for letter in range(0, 255): 1191 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1192 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1193 if title2ids: 1194 id, name, descr = title_line 1195 else: 1196 descr = title_line 1197 id = descr.split()[0] 1198 name = id 1199 record = SeqRecord(Seq(seq_string, alphabet), 1200 id=id, name=name, description=descr) 1201 qualities = [q_mapping[letter] for letter in quality_string] 1202 #DO NOT convert these into PHRED qualities automatically! 1203 if qualities and (min(qualities) < -5 or max(qualities) > 62): 1204 raise ValueError("Invalid character in quality string") 1205 #Dirty trick to speed up this line: 1206 #record.letter_annotations["solexa_quality"] = qualities 1207 dict.__setitem__(record._per_letter_annotations, 1208 "solexa_quality", qualities) 1209 yield record
1210 1211
1212 -def FastqIlluminaIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1213 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1214 1215 The optional arguments are the same as those for the FastqPhredIterator. 1216 1217 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1218 encoding PHRED integer qualities using ASCII values with an offset of 64. 1219 1220 >>> from Bio import SeqIO 1221 >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina") 1222 >>> print("%s %s" % (record.id, record.seq)) 1223 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1224 >>> max(record.letter_annotations["phred_quality"]) 1225 40 1226 >>> min(record.letter_annotations["phred_quality"]) 1227 0 1228 1229 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1230 with an ASCII offset of 64. They are approximately equal but only for high 1231 quality reads. If you have an old Solexa/Illumina file with negative 1232 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1233 1234 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina") 1235 Traceback (most recent call last): 1236 ... 1237 ValueError: Invalid character in quality string 1238 1239 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1240 """ 1241 q_mapping = dict() 1242 for letter in range(0, 255): 1243 q_mapping[chr(letter)] = letter - SOLEXA_SCORE_OFFSET 1244 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1245 if title2ids: 1246 id, name, descr = title2ids(title_line) 1247 else: 1248 descr = title_line 1249 id = descr.split()[0] 1250 name = id 1251 record = SeqRecord(Seq(seq_string, alphabet), 1252 id=id, name=name, description=descr) 1253 qualities = [q_mapping[letter] for letter in quality_string] 1254 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1255 raise ValueError("Invalid character in quality string") 1256 #Dirty trick to speed up this line: 1257 #record.letter_annotations["phred_quality"] = qualities 1258 dict.__setitem__(record._per_letter_annotations, 1259 "phred_quality", qualities) 1260 yield record
1261 1262
1263 -def QualPhredIterator(handle, alphabet=single_letter_alphabet, title2ids=None):
1264 """For QUAL files which include PHRED quality scores, but no sequence. 1265 1266 For example, consider this short QUAL file:: 1267 1268 >EAS54_6_R1_2_1_413_324 1269 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1270 26 26 26 23 23 1271 >EAS54_6_R1_2_1_540_792 1272 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1273 26 18 26 23 18 1274 >EAS54_6_R1_2_1_443_348 1275 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1276 24 18 18 18 18 1277 1278 Using this module directly you might run: 1279 1280 >>> with open("Quality/example.qual", "rU") as handle: 1281 ... for record in QualPhredIterator(handle): 1282 ... print("%s %s" % (record.id, record.seq)) 1283 EAS54_6_R1_2_1_413_324 ????????????????????????? 1284 EAS54_6_R1_2_1_540_792 ????????????????????????? 1285 EAS54_6_R1_2_1_443_348 ????????????????????????? 1286 1287 Typically however, you would call this via Bio.SeqIO instead with "qual" 1288 as the format: 1289 1290 >>> from Bio import SeqIO 1291 >>> with open("Quality/example.qual", "rU") as handle: 1292 ... for record in SeqIO.parse(handle, "qual"): 1293 ... print("%s %s" % (record.id, record.seq)) 1294 EAS54_6_R1_2_1_413_324 ????????????????????????? 1295 EAS54_6_R1_2_1_540_792 ????????????????????????? 1296 EAS54_6_R1_2_1_443_348 ????????????????????????? 1297 1298 Becase QUAL files don't contain the sequence string itself, the seq 1299 property is set to an UnknownSeq object. As no alphabet was given, this 1300 has defaulted to a generic single letter alphabet and the character "?" 1301 used. 1302 1303 By specifying a nucleotide alphabet, "N" is used instead: 1304 1305 >>> from Bio import SeqIO 1306 >>> from Bio.Alphabet import generic_dna 1307 >>> with open("Quality/example.qual", "rU") as handle: 1308 ... for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1309 ... print("%s %s" % (record.id, record.seq)) 1310 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1311 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1312 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1313 1314 However, the quality scores themselves are available as a list of integers 1315 in each record's per-letter-annotation: 1316 1317 >>> print(record.letter_annotations["phred_quality"]) 1318 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1319 1320 You can still slice one of these SeqRecord objects with an UnknownSeq: 1321 1322 >>> sub_record = record[5:10] 1323 >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) 1324 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1325 1326 As of Biopython 1.59, this parser will accept files with negatives quality 1327 scores but will replace them with the lowest possible PHRED score of zero. 1328 This will trigger a warning, previously it raised a ValueError exception. 1329 """ 1330 #Skip any text before the first record (e.g. blank lines, comments) 1331 while True: 1332 line = handle.readline() 1333 if line == "": 1334 return # Premature end of file, or just empty? 1335 if line[0] == ">": 1336 break 1337 1338 while True: 1339 if line[0] != ">": 1340 raise ValueError( 1341 "Records in Fasta files should start with '>' character") 1342 if title2ids: 1343 id, name, descr = title2ids(line[1:].rstrip()) 1344 else: 1345 descr = line[1:].rstrip() 1346 id = descr.split()[0] 1347 name = id 1348 1349 qualities = [] 1350 line = handle.readline() 1351 while True: 1352 if not line: 1353 break 1354 if line[0] == ">": 1355 break 1356 qualities.extend(int(word) for word in line.split()) 1357 line = handle.readline() 1358 1359 if qualities and min(qualities) < 0: 1360 warnings.warn(("Negative quality score %i found, " + 1361 "substituting PHRED zero instead.") 1362 % min(qualities), BiopythonParserWarning) 1363 qualities = [max(0, q) for q in qualities] 1364 1365 #Return the record and then continue... 1366 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1367 id=id, name=name, description=descr) 1368 #Dirty trick to speed up this line: 1369 #record.letter_annotations["phred_quality"] = qualities 1370 dict.__setitem__(record._per_letter_annotations, 1371 "phred_quality", qualities) 1372 yield record 1373 1374 if not line: 1375 return # StopIteration 1376 assert False, "Should not reach this line"
1377 1378
1379 -class FastqPhredWriter(SequentialSequenceWriter):
1380 """Class to write standard FASTQ format files (using PHRED quality scores). 1381 1382 Although you can use this class directly, you are strongly encouraged 1383 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1384 or the alias "fastq-sanger". For example, this code reads in a standard 1385 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1386 Sanger style FASTQ file: 1387 1388 >>> from Bio import SeqIO 1389 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1390 >>> with open("Quality/temp.fastq", "w") as out_handle: 1391 ... SeqIO.write(record_iterator, out_handle, "fastq") 1392 3 1393 1394 You might want to do this if the original file included extra line breaks, 1395 which while valid may not be supported by all tools. The output file from 1396 Biopython will have each sequence on a single line, and each quality 1397 string on a single line (which is considered desirable for maximum 1398 compatibility). 1399 1400 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1401 quality scores) is converted into a standard Sanger style FASTQ file using 1402 PHRED qualities: 1403 1404 >>> from Bio import SeqIO 1405 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1406 >>> with open("Quality/temp.fastq", "w") as out_handle: 1407 ... SeqIO.write(record_iterator, out_handle, "fastq") 1408 5 1409 1410 This code is also called if you use the .format("fastq") method of a 1411 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1412 1413 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1414 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1415 warning is issued. 1416 1417 P.S. To avoid cluttering up your working directory, you can delete this 1418 temporary file now: 1419 1420 >>> import os 1421 >>> os.remove("Quality/temp.fastq") 1422 """ 1423 assert SANGER_SCORE_OFFSET == ord("!") 1424
1425 - def write_record(self, record):
1426 """Write a single FASTQ record to the file.""" 1427 assert self._header_written 1428 assert not self._footer_written 1429 self._record_written = True 1430 #TODO - Is an empty sequence allowed in FASTQ format? 1431 if record.seq is None: 1432 raise ValueError("No sequence for record %s" % record.id) 1433 seq_str = str(record.seq) 1434 qualities_str = _get_sanger_quality_str(record) 1435 if len(qualities_str) != len(seq_str): 1436 raise ValueError("Record %s has sequence length %i but %i quality scores" 1437 % (record.id, len(seq_str), len(qualities_str))) 1438 1439 #FASTQ files can include a description, just like FASTA files 1440 #(at least, this is what the NCBI Short Read Archive does) 1441 id = self.clean(record.id) 1442 description = self.clean(record.description) 1443 if description and description.split(None, 1)[0] == id: 1444 #The description includes the id at the start 1445 title = description 1446 elif description: 1447 title = "%s %s" % (id, description) 1448 else: 1449 title = id 1450 1451 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1452 1453
1454 -class QualPhredWriter(SequentialSequenceWriter):
1455 """Class to write QUAL format files (using PHRED quality scores). 1456 1457 Although you can use this class directly, you are strongly encouraged 1458 to use the Bio.SeqIO.write() function instead. For example, this code 1459 reads in a FASTQ file and saves the quality scores into a QUAL file: 1460 1461 >>> from Bio import SeqIO 1462 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1463 >>> with open("Quality/temp.qual", "w") as out_handle: 1464 ... SeqIO.write(record_iterator, out_handle, "qual") 1465 3 1466 1467 This code is also called if you use the .format("qual") method of a 1468 SeqRecord. 1469 1470 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1471 1472 >>> import os 1473 >>> os.remove("Quality/temp.qual") 1474 """
1475 - def __init__(self, handle, wrap=60, record2title=None):
1476 """Create a QUAL writer. 1477 1478 Arguments: 1479 - handle - Handle to an output file, e.g. as returned 1480 by open(filename, "w") 1481 - wrap - Optional line length used to wrap sequence lines. 1482 Defaults to wrapping the sequence at 60 characters 1483 Use zero (or None) for no wrapping, giving a single 1484 long line for the sequence. 1485 - record2title - Optional function to return the text to be 1486 used for the title line of each record. By default 1487 a combination of the record.id and record.description 1488 is used. If the record.description starts with the 1489 record.id, then just the record.description is used. 1490 1491 The record2title argument is present for consistency with the 1492 Bio.SeqIO.FastaIO writer class. 1493 """ 1494 SequentialSequenceWriter.__init__(self, handle) 1495 #self.handle = handle 1496 self.wrap = None 1497 if wrap: 1498 if wrap < 1: 1499 raise ValueError 1500 self.wrap = wrap 1501 self.record2title = record2title
1502
1503 - def write_record(self, record):
1504 """Write a single QUAL record to the file.""" 1505 assert self._header_written 1506 assert not self._footer_written 1507 self._record_written = True 1508 1509 handle = self.handle 1510 wrap = self.wrap 1511 1512 if self.record2title: 1513 title = self.clean(self.record2title(record)) 1514 else: 1515 id = self.clean(record.id) 1516 description = self.clean(record.description) 1517 if description and description.split(None, 1)[0] == id: 1518 #The description includes the id at the start 1519 title = description 1520 elif description: 1521 title = "%s %s" % (id, description) 1522 else: 1523 title = id 1524 handle.write(">%s\n" % title) 1525 1526 qualities = _get_phred_quality(record) 1527 try: 1528 #This rounds to the nearest integer. 1529 #TODO - can we record a float in a qual file? 1530 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1531 except TypeError as e: 1532 if None in qualities: 1533 raise TypeError("A quality value of None was found") 1534 else: 1535 raise e 1536 1537 if wrap > 5: 1538 #Fast wrapping 1539 data = " ".join(qualities_strs) 1540 while True: 1541 if len(data) <= wrap: 1542 self.handle.write(data + "\n") 1543 break 1544 else: 1545 #By construction there must be spaces in the first X chars 1546 #(unless we have X digit or higher quality scores!) 1547 i = data.rfind(" ", 0, wrap) 1548 handle.write(data[:i] + "\n") 1549 data = data[i + 1:] 1550 elif wrap: 1551 #Safe wrapping 1552 while qualities_strs: 1553 line = qualities_strs.pop(0) 1554 while qualities_strs \ 1555 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1556 line += " " + qualities_strs.pop(0) 1557 handle.write(line + "\n") 1558 else: 1559 #No wrapping 1560 data = " ".join(qualities_strs) 1561 handle.write(data + "\n")
1562 1563
1564 -class FastqSolexaWriter(SequentialSequenceWriter):
1565 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1566 1567 This outputs FASTQ files like those from the early Solexa/Illumina 1568 pipeline, using Solexa scores and an ASCII offset of 64. These are 1569 NOT compatible with the standard Sanger style PHRED FASTQ files. 1570 1571 If your records contain a "solexa_quality" entry under letter_annotations, 1572 this is used, otherwise any "phred_quality" entry will be used after 1573 conversion using the solexa_quality_from_phred function. If neither style 1574 of quality scores are present, an exception is raised. 1575 1576 Although you can use this class directly, you are strongly encouraged 1577 to use the Bio.SeqIO.write() function instead. For example, this code 1578 reads in a FASTQ file and re-saves it as another FASTQ file: 1579 1580 >>> from Bio import SeqIO 1581 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1582 >>> with open("Quality/temp.fastq", "w") as out_handle: 1583 ... SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1584 5 1585 1586 You might want to do this if the original file included extra line breaks, 1587 which (while valid) may not be supported by all tools. The output file 1588 from Biopython will have each sequence on a single line, and each quality 1589 string on a single line (which is considered desirable for maximum 1590 compatibility). 1591 1592 This code is also called if you use the .format("fastq-solexa") method of 1593 a SeqRecord. For example, 1594 1595 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1596 >>> print(record.format("fastq-solexa")) 1597 @Test PHRED qualities from 40 to 0 inclusive 1598 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1599 + 1600 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1601 <BLANKLINE> 1602 1603 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1604 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1605 a warning is issued. 1606 1607 P.S. Don't forget to delete the temp file if you don't need it anymore: 1608 1609 >>> import os 1610 >>> os.remove("Quality/temp.fastq") 1611 """
1612 - def write_record(self, record):
1613 """Write a single FASTQ record to the file.""" 1614 assert self._header_written 1615 assert not self._footer_written 1616 self._record_written = True 1617 1618 #TODO - Is an empty sequence allowed in FASTQ format? 1619 if record.seq is None: 1620 raise ValueError("No sequence for record %s" % record.id) 1621 seq_str = str(record.seq) 1622 qualities_str = _get_solexa_quality_str(record) 1623 if len(qualities_str) != len(seq_str): 1624 raise ValueError("Record %s has sequence length %i but %i quality scores" 1625 % (record.id, len(seq_str), len(qualities_str))) 1626 1627 #FASTQ files can include a description, just like FASTA files 1628 #(at least, this is what the NCBI Short Read Archive does) 1629 id = self.clean(record.id) 1630 description = self.clean(record.description) 1631 if description and description.split(None, 1)[0] == id: 1632 #The description includes the id at the start 1633 title = description 1634 elif description: 1635 title = "%s %s" % (id, description) 1636 else: 1637 title = id 1638 1639 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1640 1641
1642 -class FastqIlluminaWriter(SequentialSequenceWriter):
1643 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1644 1645 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1646 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1647 compatible with the standard Sanger style PHRED FASTQ files which use an 1648 ASCII offset of 32. 1649 1650 Although you can use this class directly, you are strongly encouraged to 1651 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1652 instead. This code is also called if you use the .format("fastq-illumina") 1653 method of a SeqRecord. For example, 1654 1655 >>> from Bio import SeqIO 1656 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1657 >>> print(record.format("fastq-illumina")) 1658 @Test PHRED qualities from 40 to 0 inclusive 1659 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1660 + 1661 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1662 <BLANKLINE> 1663 1664 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1665 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1666 warning is issued. 1667 """
1668 - def write_record(self, record):
1669 """Write a single FASTQ record to the file.""" 1670 assert self._header_written 1671 assert not self._footer_written 1672 self._record_written = True 1673 1674 #TODO - Is an empty sequence allowed in FASTQ format? 1675 if record.seq is None: 1676 raise ValueError("No sequence for record %s" % record.id) 1677 seq_str = str(record.seq) 1678 qualities_str = _get_illumina_quality_str(record) 1679 if len(qualities_str) != len(seq_str): 1680 raise ValueError("Record %s has sequence length %i but %i quality scores" 1681 % (record.id, len(seq_str), len(qualities_str))) 1682 1683 #FASTQ files can include a description, just like FASTA files 1684 #(at least, this is what the NCBI Short Read Archive does) 1685 id = self.clean(record.id) 1686 description = self.clean(record.description) 1687 if description and description.split(None, 1)[0] == id: 1688 #The description includes the id at the start 1689 title = description 1690 elif description: 1691 title = "%s %s" % (id, description) 1692 else: 1693 title = id 1694 1695 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1696 1697
1698 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet=single_letter_alphabet, title2ids=None):
1699 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1700 1701 For example, consider this short QUAL file with PHRED quality scores:: 1702 1703 >EAS54_6_R1_2_1_413_324 1704 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1705 26 26 26 23 23 1706 >EAS54_6_R1_2_1_540_792 1707 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1708 26 18 26 23 18 1709 >EAS54_6_R1_2_1_443_348 1710 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1711 24 18 18 18 18 1712 1713 And a matching FASTA file:: 1714 1715 >EAS54_6_R1_2_1_413_324 1716 CCCTTCTTGTCTTCAGCGTTTCTCC 1717 >EAS54_6_R1_2_1_540_792 1718 TTGGCAGGCCAAGGCCGATGGATCA 1719 >EAS54_6_R1_2_1_443_348 1720 GTTGCTTCTGGCGTGGGTGGGGGGG 1721 1722 You can parse these separately using Bio.SeqIO with the "qual" and 1723 "fasta" formats, but then you'll get a group of SeqRecord objects with 1724 no sequence, and a matching group with the sequence but not the 1725 qualities. Because it only deals with one input file handle, Bio.SeqIO 1726 can't be used to read the two files together - but this function can! 1727 For example, 1728 1729 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1730 ... open("Quality/example.qual", "rU")) 1731 >>> for record in rec_iter: 1732 ... print("%s %s" % (record.id, record.seq)) 1733 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1734 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1735 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1736 1737 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1738 they are in each record's per-letter-annotation dictionary as a simple 1739 list of integers: 1740 1741 >>> print(record.letter_annotations["phred_quality"]) 1742 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1743 1744 If you have access to data as a FASTQ format file, using that directly 1745 would be simpler and more straight forward. Note that you can easily use 1746 this function to convert paired FASTA and QUAL files into FASTQ files: 1747 1748 >>> from Bio import SeqIO 1749 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1750 ... open("Quality/example.qual", "rU")) 1751 >>> with open("Quality/temp.fastq", "w") as out_handle: 1752 ... SeqIO.write(rec_iter, out_handle, "fastq") 1753 3 1754 1755 And don't forget to clean up the temp file if you don't need it anymore: 1756 1757 >>> import os 1758 >>> os.remove("Quality/temp.fastq") 1759 """ 1760 from Bio.SeqIO.FastaIO import FastaIterator 1761 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, 1762 title2ids=title2ids) 1763 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, 1764 title2ids=title2ids) 1765 1766 #Using (Python 3 style) zip wouldn't load everything into memory, 1767 #but also would not catch any extra records found in only one file. 1768 while True: 1769 try: 1770 f_rec = next(fasta_iter) 1771 except StopIteration: 1772 f_rec = None 1773 try: 1774 q_rec = next(qual_iter) 1775 except StopIteration: 1776 q_rec = None 1777 if f_rec is None and q_rec is None: 1778 #End of both files 1779 break 1780 if f_rec is None: 1781 raise ValueError("FASTA file has more entries than the QUAL file.") 1782 if q_rec is None: 1783 raise ValueError("QUAL file has more entries than the FASTA file.") 1784 if f_rec.id != q_rec.id: 1785 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." 1786 % (f_rec.id, q_rec.id)) 1787 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1788 raise ValueError("Sequence length and number of quality scores disagree for %s" 1789 % f_rec.id) 1790 #Merge the data.... 1791 f_rec.letter_annotations[ 1792 "phred_quality"] = q_rec.letter_annotations["phred_quality"] 1793 yield f_rec
1794 #Done 1795 1796 1797 if __name__ == "__main__": 1798 from Bio._utils import run_doctest 1799 run_doctest(verbose=0) 1800