Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

  1  # Copyright 2006-2010 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5  # 
  6  #Nice link: 
  7  # http://www.ebi.ac.uk/help/formats_frame.html 
  8   
  9  """Sequence input/output as SeqRecord objects. 
 10   
 11  Bio.SeqIO is also documented at U{http://biopython.org/wiki/SeqIO} and by 
 12  a whole chapter in our tutorial: 
 13   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
 14   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf} 
 15   
 16  Input 
 17  ===== 
 18  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
 19  (or in recent versions of Biopython alternatively a filename as a string), 
 20  and format string.  This returns an iterator giving SeqRecord objects: 
 21   
 22      >>> from Bio import SeqIO 
 23      >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
 24      ...     print record.id, len(record) 
 25      gi|1348912|gb|G26680|G26680 633 
 26      gi|1348917|gb|G26685|G26685 413 
 27      gi|1592936|gb|G29385|G29385 471 
 28   
 29  Note that the parse() function will invoke the relevant parser for the 
 30  format with its default settings.  You may want more control, in which case 
 31  you need to create a format specific sequence iterator directly. 
 32   
 33  Input - Single Records 
 34  ====================== 
 35  If you expect your file to contain one-and-only-one record, then we provide 
 36  the following 'helper' function which will return a single SeqRecord, or 
 37  raise an exception if there are no records or more than one record: 
 38   
 39      >>> from Bio import SeqIO 
 40      >>> record = SeqIO.read("Fasta/f001", "fasta") 
 41      >>> print record.id, len(record) 
 42      gi|3318709|pdb|1A91| 79 
 43   
 44  This style is useful when you expect a single record only (and would 
 45  consider multiple records an error).  For example, when dealing with GenBank 
 46  files for bacterial genomes or chromosomes, there is normally only a single 
 47  record.  Alternatively, use this with a handle when downloading a single 
 48  record from the internet. 
 49   
 50  However, if you just want the first record from a file containing multiple 
 51  record, use the iterator's next() method: 
 52   
 53      >>> from Bio import SeqIO 
 54      >>> record = SeqIO.parse("Fasta/f002", "fasta").next() 
 55      >>> print record.id, len(record) 
 56      gi|1348912|gb|G26680|G26680 633 
 57   
 58  The above code will work as long as the file contains at least one record. 
 59  Note that if there is more than one record, the remaining records will be 
 60  silently ignored. 
 61   
 62   
 63  Input - Multiple Records 
 64  ======================== 
 65  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
 66  using a sequence iterator can save you a lot of memory (RAM).  There is 
 67  less benefit for interlaced file formats (e.g. most multiple alignment file 
 68  formats).  However, an iterator only lets you access the records one by one. 
 69   
 70  If you want random access to the records by number, turn this into a list: 
 71   
 72      >>> from Bio import SeqIO 
 73      >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
 74      >>> len(records) 
 75      3 
 76      >>> print records[1].id 
 77      gi|1348917|gb|G26685|G26685 
 78   
 79  If you want random access to the records by a key such as the record id, 
 80  turn the iterator into a dictionary: 
 81   
 82      >>> from Bio import SeqIO 
 83      >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
 84      >>> len(record_dict) 
 85      3 
 86      >>> print len(record_dict["gi|1348917|gb|G26685|G26685"]) 
 87      413 
 88   
 89  However, using list() or the to_dict() function will load all the records 
 90  into memory at once, and therefore is not possible on very large files. 
 91  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
 92  providing dictionary like access to any record. For example, 
 93   
 94      >>> from Bio import SeqIO 
 95      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 96      >>> len(record_dict) 
 97      3 
 98      >>> print len(record_dict["gi|1348917|gb|G26685|G26685"]) 
 99      413 
100   
101  Many but not all of the supported input file formats can be indexed like 
102  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
103  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
104   
105  In most cases you can also use SeqIO.index to get the record from the file 
106  as a raw string (not a SeqRecord). This can be useful for example to extract 
107  a sub-set of records from a file where SeqIO cannot output the file format 
108  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
109  keep the output 100% identical to the input). For example, 
110   
111      >>> from Bio import SeqIO 
112      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
113      >>> len(record_dict) 
114      3 
115      >>> print record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode() 
116      >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
117      CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
118      ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
119      GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
120      GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
121      TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
122      AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
123      <BLANKLINE> 
124      >>> print record_dict["gi|1348917|gb|G26685|G26685"].format("fasta") 
125      >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
126      CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
127      TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
128      TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
129      ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
130      AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
131      TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
132      CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
133      <BLANKLINE> 
134   
135  Here the original file and what Biopython would output differ in the line 
136  wrapping. Also note that under Python 3, the get_raw method will return a 
137  bytes string, hence the use of decode to turn it into a (unicode) string. 
138  This is uncessary on Python 2. 
139   
140   
141  Input - Alignments 
142  ================== 
143  You can read in alignment files as alignment objects using Bio.AlignIO. 
144  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
145  you a SeqRecord for each row of each alignment: 
146   
147      >>> from Bio import SeqIO 
148      >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
149      ...     print record.id, len(record) 
150      gi|167877390|gb|EDS40773.1| 447 
151      gi|167234445|ref|NP_001107837. 447 
152      gi|74100009|gb|AAZ99217.1| 447 
153      gi|13990994|dbj|BAA33523.2| 447 
154      gi|56122354|gb|AAV74328.1| 447 
155   
156  Output 
157  ====== 
158  Use the function Bio.SeqIO.write(...), which takes a complete set of 
159  SeqRecord objects (either as a list, or an iterator), an output file handle 
160  (or in recent versions of Biopython an output filename as a string) and of 
161  course the file format:: 
162   
163      from Bio import SeqIO 
164      records = ... 
165      SeqIO.write(records, "example.faa", "fasta") 
166   
167  Or, using a handle:: 
168   
169      from Bio import SeqIO 
170      records = ... 
171      handle = open("example.faa", "w") 
172      SeqIO.write(records, handle, "fasta") 
173      handle.close() 
174   
175  You are expected to call this function once (with all your records) and if 
176  using a handle, make sure you close it to flush the data to the hard disk. 
177   
178   
179  Output - Advanced 
180  ================= 
181  The effect of calling write() multiple times on a single file will vary 
182  depending on the file format, and is best avoided unless you have a strong 
183  reason to do so. 
184   
185  If you give a filename, then each time you call write() the existing file 
186  will be overwriten. For sequential files formats (e.g. fasta, genbank) each 
187  "record block" holds a single sequence.  For these files it would probably 
188  be safe to call write() multiple times by re-using the same handle. 
189   
190   
191  However, trying this for certain alignment formats (e.g. phylip, clustal, 
192  stockholm) would have the effect of concatenating several multiple sequence 
193  alignments together.  Such files are created by the PHYLIP suite of programs 
194  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
195   
196   
197  Conversion 
198  ========== 
199  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
200  file format conversions. Additionally, it may use file format specific 
201  optimisations so this should be the fastest way too. 
202   
203  In general however, you can combine the Bio.SeqIO.parse(...) function with 
204  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
205  generator expressions or generator functions provides a memory efficient way 
206  to perform filtering or other extra operations as part of the process. 
207   
208   
209  File Formats 
210  ============ 
211  When specifying the file format, use lowercase strings.  The same format 
212  names are also used in Bio.AlignIO and include the following: 
213   
214   - abif    - Applied Biosystem's sequencing trace format 
215   - ace     - Reads the contig sequences from an ACE assembly file. 
216   - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
217   - fasta   - The generic sequence file format where each record starts with 
218               an identifer line starting with a ">" character, followed by 
219               lines of sequence. 
220   - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
221               sequence quality values (with an ASCII offset of 33). 
222   - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
223   - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
224               encodes Solexa quality scores (not PHRED quality scores) with an 
225               ASCII offset of 64. 
226   - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
227               which encodes PHRED quality scores with an ASCII offset of 64 
228               (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
229               will produce FASTQ files using the standard Sanger encoding. 
230   - genbank - The GenBank or GenPept flat file format. 
231   - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
232   - ig      - The IntelliGenetics file format, apparently the same as the 
233               MASE alignment format. 
234   - imgt    - An EMBL like format from IMGT where the feature tables are more 
235               indented to allow for longer feature types. 
236   - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
237   - pir     - A "FASTA like" format introduced by the National Biomedical 
238               Research Foundation (NBRF) for the Protein Information Resource 
239               (PIR) database, now part of UniProt. 
240   - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
241   - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
242   - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
243   - swiss   - Plain text Swiss-Prot aka UniProt format. 
244   - tab     - Simple two column tab separated sequence files, where each 
245               line holds a record's identifier and sequence. For example, 
246               this is used as by Aligent's eArray software when saving 
247               microarray probes in a minimal tab delimited text file. 
248   - qual    - A "FASTA like" format holding PHRED quality values from 
249               sequencing DNA, but no actual sequences (usually provided 
250               in separate FASTA files). 
251   - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
252               text format which we call "swiss") 
253   
254  Note that while Bio.SeqIO can read all the above file formats, it cannot 
255  write to all of them. 
256   
257  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
258  "phlip" and "stockholm", which gives you access to the individual sequences 
259  making up each alignment as SeqRecords. 
260  """ 
261   
262  # For using with statement in Python 2.5 or Jython 
263  from __future__ import with_statement 
264   
265  __docformat__ = "epytext en"  # not just plaintext 
266   
267  #TODO 
268  # - define policy on reading aligned sequences with gaps in 
269  #   (e.g. - and . characters) including how the alphabet interacts 
270  # 
271  # - How best to handle unique/non unique record.id when writing. 
272  #   For most file formats reading such files is fine; The stockholm 
273  #   parser would fail. 
274  # 
275  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
276  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
277   
278  """ 
279  FAO BioPython Developers 
280  ======================== 
281  The way I envision this SeqIO system working as that for any sequence file 
282  format we have an iterator that returns SeqRecord objects. 
283   
284  This also applies to interlaced fileformats (like clustal - although that 
285  is now handled via Bio.AlignIO instead) where the file cannot be read record 
286  by record.  You should still return an iterator, even if the implementation 
287  could just as easily return a list. 
288   
289  These file format specific sequence iterators may be implemented as: 
290  * Classes which take a handle for __init__ and provide the __iter__ method 
291  * Functions that take a handle, and return an iterator object 
292  * Generator functions that take a handle, and yield SeqRecord objects 
293   
294  It is then trivial to turn this iterator into a list of SeqRecord objects, 
295  an in memory dictionary, or a multiple sequence alignment object. 
296   
297  For building the dictionary by default the id propery of each SeqRecord is 
298  used as the key.  You should always populate the id property, and it should 
299  be unique in most cases. For some file formats the accession number is a good 
300  choice.  If the file itself contains ambiguous identifiers, don't try and 
301  dis-ambiguate them - return them as is. 
302   
303  When adding a new file format, please use the same lower case format name 
304  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
305   
306  See also http://biopython.org/wiki/SeqIO_dev 
307   
308  --Peter 
309  """ 
310   
311   
312  from Bio.File import as_handle 
313  from Bio.SeqRecord import SeqRecord 
314  from Bio.Align import MultipleSeqAlignment 
315  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
316   
317  import AbiIO 
318  import AceIO 
319  import FastaIO 
320  import IgIO  # IntelliGenetics or MASE format 
321  import InsdcIO  # EMBL and GenBank 
322  import PdbIO 
323  import PhdIO 
324  import PirIO 
325  import SeqXmlIO 
326  import SffIO 
327  import SwissIO 
328  import TabIO 
329  import QualityIO  # FastQ and qual files 
330  import UniprotIO 
331   
332   
333  #Convention for format names is "mainname-subtype" in lower case. 
334  #Please use the same names as BioPerl or EMBOSS where possible. 
335  # 
336  #Note that this simple system copes with defining 
337  #multiple possible iterators for a given format/extension 
338  #with the -subtype suffix 
339  # 
340  #Most alignment file formats will be handled via Bio.AlignIO 
341   
342  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
343                       "gb": InsdcIO.GenBankIterator, 
344                       "genbank": InsdcIO.GenBankIterator, 
345                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
346                       "embl": InsdcIO.EmblIterator, 
347                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
348                       "imgt": InsdcIO.ImgtIterator, 
349                       "ig": IgIO.IgIterator, 
350                       "swiss": SwissIO.SwissIterator, 
351                       "pdb-atom": PdbIO.PdbAtomIterator, 
352                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
353                       "phd": PhdIO.PhdIterator, 
354                       "ace": AceIO.AceIterator, 
355                       "tab": TabIO.TabIterator, 
356                       "pir": PirIO.PirIterator, 
357                       "fastq": QualityIO.FastqPhredIterator, 
358                       "fastq-sanger": QualityIO.FastqPhredIterator, 
359                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
360                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
361                       "qual": QualityIO.QualPhredIterator, 
362                       "sff": SffIO.SffIterator, 
363                       #Not sure about this in the long run: 
364                       "sff-trim": SffIO._SffTrimIterator, 
365                       "uniprot-xml": UniprotIO.UniprotIterator, 
366                       "seqxml": SeqXmlIO.SeqXmlIterator, 
367                       "abi": AbiIO.AbiIterator, 
368                       "abi-trim": AbiIO._AbiTrimIterator, 
369                       } 
370   
371  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
372                     "gb": InsdcIO.GenBankWriter, 
373                     "genbank": InsdcIO.GenBankWriter, 
374                     "embl": InsdcIO.EmblWriter, 
375                     "imgt": InsdcIO.ImgtWriter, 
376                     "tab": TabIO.TabWriter, 
377                     "fastq": QualityIO.FastqPhredWriter, 
378                     "fastq-sanger": QualityIO.FastqPhredWriter, 
379                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
380                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
381                     "phd": PhdIO.PhdWriter, 
382                     "qual": QualityIO.QualPhredWriter, 
383                     "sff": SffIO.SffWriter, 
384                     "seqxml": SeqXmlIO.SeqXmlWriter, 
385                     } 
386   
387  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
388   
389   
390 -def write(sequences, handle, format):
391 """Write complete set of sequences to a file. 392 393 - sequences - A list (or iterator) of SeqRecord objects, or (if using 394 Biopython 1.54 or later) a single SeqRecord. 395 - handle - File handle object to write to, or filename as string 396 (note older versions of Biopython only took a handle). 397 - format - lower case string describing the file format to write. 398 399 You should close the handle after calling this function. 400 401 Returns the number of records written (as an integer). 402 """ 403 from Bio import AlignIO 404 405 #Try and give helpful error messages: 406 if not isinstance(format, basestring): 407 raise TypeError("Need a string for the file format (lower case)") 408 if not format: 409 raise ValueError("Format required (lower case string)") 410 if format != format.lower(): 411 raise ValueError("Format string '%s' should be lower case" % format) 412 413 if isinstance(sequences, SeqRecord): 414 #This raised an exception in order version of Biopython 415 sequences = [sequences] 416 417 if format in _BinaryFormats: 418 mode = 'wb' 419 else: 420 mode = 'w' 421 422 with as_handle(handle, mode) as fp: 423 #Map the file format to a writer class 424 if format in _FormatToWriter: 425 writer_class = _FormatToWriter[format] 426 count = writer_class(fp).write_file(sequences) 427 elif format in AlignIO._FormatToWriter: 428 #Try and turn all the records into a single alignment, 429 #and write that using Bio.AlignIO 430 alignment = MultipleSeqAlignment(sequences) 431 alignment_count = AlignIO.write([alignment], fp, format) 432 assert alignment_count == 1, \ 433 "Internal error - the underlying writer " \ 434 " should have returned 1, not %s" % repr(alignment_count) 435 count = len(alignment) 436 del alignment_count, alignment 437 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 438 raise ValueError("Reading format '%s' is supported, but not writing" 439 % format) 440 else: 441 raise ValueError("Unknown format '%s'" % format) 442 443 assert isinstance(count, int), "Internal error - the underlying %s " \ 444 "writer should have returned the record count, not %s" \ 445 % (format, repr(count)) 446 447 return count
448 449
450 -def parse(handle, format, alphabet=None):
451 r"""Turns a sequence file into an iterator returning SeqRecords. 452 453 - handle - handle to the file, or the filename as a string 454 (note older versions of Biopython only took a handle). 455 - format - lower case string describing the file format. 456 - alphabet - optional Alphabet object, useful when the sequence type 457 cannot be automatically inferred from the file itself 458 (e.g. format="fasta" or "tab") 459 460 Typical usage, opening a file to read in, and looping over the record(s): 461 462 >>> from Bio import SeqIO 463 >>> filename = "Fasta/sweetpea.nu" 464 >>> for record in SeqIO.parse(filename, "fasta"): 465 ... print "ID", record.id 466 ... print "Sequence length", len(record) 467 ... print "Sequence alphabet", record.seq.alphabet 468 ID gi|3176602|gb|U78617.1|LOU78617 469 Sequence length 309 470 Sequence alphabet SingleLetterAlphabet() 471 472 For file formats like FASTA where the alphabet cannot be determined, it 473 may be useful to specify the alphabet explicitly: 474 475 >>> from Bio import SeqIO 476 >>> from Bio.Alphabet import generic_dna 477 >>> filename = "Fasta/sweetpea.nu" 478 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 479 ... print "ID", record.id 480 ... print "Sequence length", len(record) 481 ... print "Sequence alphabet", record.seq.alphabet 482 ID gi|3176602|gb|U78617.1|LOU78617 483 Sequence length 309 484 Sequence alphabet DNAAlphabet() 485 486 If you have a string 'data' containing the file contents, you must 487 first turn this into a handle in order to parse it: 488 489 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 490 >>> from Bio import SeqIO 491 >>> from StringIO import StringIO 492 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 493 ... print record.id, record.seq 494 Alpha ACCGGATGTA 495 Beta AGGCTCGGTTA 496 497 Use the Bio.SeqIO.read(...) function when you expect a single record 498 only. 499 """ 500 #NOTE - The above docstring has some raw \n characters needed 501 #for the StringIO example, hense the whole docstring is in raw 502 #string mode (see the leading r before the opening quote). 503 from Bio import AlignIO 504 505 #Hack for SFF, will need to make this more general in future 506 if format in _BinaryFormats: 507 mode = 'rb' 508 else: 509 mode = 'rU' 510 511 #Try and give helpful error messages: 512 if not isinstance(format, basestring): 513 raise TypeError("Need a string for the file format (lower case)") 514 if not format: 515 raise ValueError("Format required (lower case string)") 516 if format != format.lower(): 517 raise ValueError("Format string '%s' should be lower case" % format) 518 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 519 isinstance(alphabet, AlphabetEncoder)): 520 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 521 522 with as_handle(handle, mode) as fp: 523 #Map the file format to a sequence iterator: 524 if format in _FormatToIterator: 525 iterator_generator = _FormatToIterator[format] 526 if alphabet is None: 527 i = iterator_generator(fp) 528 else: 529 try: 530 i = iterator_generator(fp, alphabet=alphabet) 531 except TypeError: 532 i = _force_alphabet(iterator_generator(fp), alphabet) 533 elif format in AlignIO._FormatToIterator: 534 #Use Bio.AlignIO to read in the alignments 535 i = (r for alignment in AlignIO.parse(fp, format, 536 alphabet=alphabet) 537 for r in alignment) 538 else: 539 raise ValueError("Unknown format '%s'" % format) 540 #This imposes some overhead... wait until we drop Python 2.4 to fix it 541 for r in i: 542 yield r
543 544
545 -def _force_alphabet(record_iterator, alphabet):
546 """Iterate over records, over-riding the alphabet (PRIVATE).""" 547 #Assume the alphabet argument has been pre-validated 548 given_base_class = _get_base_alphabet(alphabet).__class__ 549 for record in record_iterator: 550 if isinstance(_get_base_alphabet(record.seq.alphabet), 551 given_base_class): 552 record.seq.alphabet = alphabet 553 yield record 554 else: 555 raise ValueError("Specified alphabet %s clashes with " 556 "that determined from the file, %s" 557 % (repr(alphabet), repr(record.seq.alphabet)))
558 559
560 -def read(handle, format, alphabet=None):
561 """Turns a sequence file into a single SeqRecord. 562 563 - handle - handle to the file, or the filename as a string 564 (note older versions of Biopython only took a handle). 565 - format - string describing the file format. 566 - alphabet - optional Alphabet object, useful when the sequence type 567 cannot be automatically inferred from the file itself 568 (e.g. format="fasta" or "tab") 569 570 This function is for use parsing sequence files containing 571 exactly one record. For example, reading a GenBank file: 572 573 >>> from Bio import SeqIO 574 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 575 >>> print "ID", record.id 576 ID AC007323.5 577 >>> print "Sequence length", len(record) 578 Sequence length 86436 579 >>> print "Sequence alphabet", record.seq.alphabet 580 Sequence alphabet IUPACAmbiguousDNA() 581 582 If the handle contains no records, or more than one record, 583 an exception is raised. For example: 584 585 >>> from Bio import SeqIO 586 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 587 Traceback (most recent call last): 588 ... 589 ValueError: More than one record found in handle 590 591 If however you want the first record from a file containing 592 multiple records this function would raise an exception (as 593 shown in the example above). Instead use: 594 595 >>> from Bio import SeqIO 596 >>> record = SeqIO.parse("GenBank/cor6_6.gb", "genbank").next() 597 >>> print "First record's ID", record.id 598 First record's ID X55053.1 599 600 Use the Bio.SeqIO.parse(handle, format) function if you want 601 to read multiple records from the handle. 602 """ 603 iterator = parse(handle, format, alphabet) 604 try: 605 first = iterator.next() 606 except StopIteration: 607 first = None 608 if first is None: 609 raise ValueError("No records found in handle") 610 try: 611 second = iterator.next() 612 except StopIteration: 613 second = None 614 if second is not None: 615 raise ValueError("More than one record found in handle") 616 return first
617 618
619 -def to_dict(sequences, key_function=None):
620 """Turns a sequence iterator or list into a dictionary. 621 622 - sequences - An iterator that returns SeqRecord objects, 623 or simply a list of SeqRecord objects. 624 - key_function - Optional callback function which when given a 625 SeqRecord should return a unique key for the dictionary. 626 627 e.g. key_function = lambda rec : rec.name 628 or, key_function = lambda rec : rec.description.split()[0] 629 630 If key_function is omitted then record.id is used, on the assumption 631 that the records objects returned are SeqRecords with a unique id. 632 633 If there are duplicate keys, an error is raised. 634 635 Example usage, defaulting to using the record.id as key: 636 637 >>> from Bio import SeqIO 638 >>> filename = "GenBank/cor6_6.gb" 639 >>> format = "genbank" 640 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 641 >>> print sorted(id_dict) 642 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 643 >>> print id_dict["L31939.1"].description 644 Brassica rapa (clone bif72) kin mRNA, complete cds. 645 646 A more complex example, using the key_function argument in order to 647 use a sequence checksum as the dictionary key: 648 649 >>> from Bio import SeqIO 650 >>> from Bio.SeqUtils.CheckSum import seguid 651 >>> filename = "GenBank/cor6_6.gb" 652 >>> format = "genbank" 653 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 654 ... key_function = lambda rec : seguid(rec.seq)) 655 >>> for key, record in sorted(seguid_dict.iteritems()): 656 ... print key, record.id 657 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 658 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 659 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 660 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 661 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 662 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 663 664 This approach is not suitable for very large sets of sequences, as all 665 the SeqRecord objects are held in memory. Instead, consider using the 666 Bio.SeqIO.index() function (if it supports your particular file format). 667 """ 668 if key_function is None: 669 key_function = lambda rec: rec.id 670 671 d = dict() 672 for record in sequences: 673 key = key_function(record) 674 if key in d: 675 raise ValueError("Duplicate key '%s'" % key) 676 d[key] = record 677 return d
678 679
680 -def index(filename, format, alphabet=None, key_function=None):
681 """Indexes a sequence file and returns a dictionary like object. 682 683 - filename - string giving name of file to be indexed 684 - format - lower case string describing the file format 685 - alphabet - optional Alphabet object, useful when the sequence type 686 cannot be automatically inferred from the file itself 687 (e.g. format="fasta" or "tab") 688 - key_function - Optional callback function which when given a 689 SeqRecord identifier string should return a unique 690 key for the dictionary. 691 692 This indexing function will return a dictionary like object, giving the 693 SeqRecord objects as values: 694 695 >>> from Bio import SeqIO 696 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 697 >>> len(records) 698 3 699 >>> sorted(records) 700 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 701 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta") 702 >EAS54_6_R1_2_1_540_792 703 TTGGCAGGCCAAGGCCGATGGATCA 704 <BLANKLINE> 705 >>> "EAS54_6_R1_2_1_540_792" in records 706 True 707 >>> print records.get("Missing", None) 708 None 709 710 If the file is BGZF compressed, this is detected automatically. Ordinary 711 GZIP files are not supported: 712 713 >>> from Bio import SeqIO 714 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 715 >>> len(records) 716 3 717 >>> print records["EAS54_6_R1_2_1_540_792"].seq 718 TTGGCAGGCCAAGGCCGATGGATCA 719 720 Note that this pseudo dictionary will not support all the methods of a 721 true Python dictionary, for example values() is not defined since this 722 would require loading all of the records into memory at once. 723 724 When you call the index function, it will scan through the file, noting 725 the location of each record. When you access a particular record via the 726 dictionary methods, the code will jump to the appropriate part of the 727 file and then parse that section into a SeqRecord. 728 729 Note that not all the input formats supported by Bio.SeqIO can be used 730 with this index function. It is designed to work only with sequential 731 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 732 interlaced file format (e.g. alignment formats such as "clustal"). 733 734 For small files, it may be more efficient to use an in memory Python 735 dictionary, e.g. 736 737 >>> from Bio import SeqIO 738 >>> records = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fastq"), "fastq")) 739 >>> len(records) 740 3 741 >>> sorted(records) 742 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 743 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta") 744 >EAS54_6_R1_2_1_540_792 745 TTGGCAGGCCAAGGCCGATGGATCA 746 <BLANKLINE> 747 748 As with the to_dict() function, by default the id string of each record 749 is used as the key. You can specify a callback function to transform 750 this (the record identifier string) into your preferred key. For example: 751 752 >>> from Bio import SeqIO 753 >>> def make_tuple(identifier): 754 ... parts = identifier.split("_") 755 ... return int(parts[-2]), int(parts[-1]) 756 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 757 ... key_function=make_tuple) 758 >>> len(records) 759 3 760 >>> sorted(records) 761 [(413, 324), (443, 348), (540, 792)] 762 >>> print records[(540, 792)].format("fasta") 763 >EAS54_6_R1_2_1_540_792 764 TTGGCAGGCCAAGGCCGATGGATCA 765 <BLANKLINE> 766 >>> (540, 792) in records 767 True 768 >>> "EAS54_6_R1_2_1_540_792" in records 769 False 770 >>> print records.get("Missing", None) 771 None 772 773 Another common use case would be indexing an NCBI style FASTA file, 774 where you might want to extract the GI number from the FASTA identifer 775 to use as the dictionary key. 776 777 Notice that unlike the to_dict() function, here the key_function does 778 not get given the full SeqRecord to use to generate the key. Doing so 779 would impose a severe performance penalty as it would require the file 780 to be completely parsed while building the index. Right now this is 781 usually avoided. 782 783 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 784 """ 785 #Try and give helpful error messages: 786 if not isinstance(filename, basestring): 787 raise TypeError("Need a filename (not a handle)") 788 if not isinstance(format, basestring): 789 raise TypeError("Need a string for the file format (lower case)") 790 if not format: 791 raise ValueError("Format required (lower case string)") 792 if format != format.lower(): 793 raise ValueError("Format string '%s' should be lower case" % format) 794 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 795 isinstance(alphabet, AlphabetEncoder)): 796 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 797 798 #Map the file format to a sequence iterator: 799 from _index import _FormatToRandomAccess # Lazy import 800 from Bio.File import _IndexedSeqFileDict 801 try: 802 proxy_class = _FormatToRandomAccess[format] 803 except KeyError: 804 raise ValueError("Unsupported format %r" % format) 805 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 806 % (filename, format, alphabet, key_function) 807 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 808 key_function, repr, "SeqRecord")
809 810
811 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 812 key_function=None):
813 """Index several sequence files and return a dictionary like object. 814 815 The index is stored in an SQLite database rather than in memory (as in the 816 Bio.SeqIO.index(...) function). 817 818 - index_filename - Where to store the SQLite index 819 - filenames - list of strings specifying file(s) to be indexed, or when 820 indexing a single file this can be given as a string. 821 (optional if reloading an existing index, but must match) 822 - format - lower case string describing the file format 823 (optional if reloading an existing index, but must match) 824 - alphabet - optional Alphabet object, useful when the sequence type 825 cannot be automatically inferred from the file itself 826 (e.g. format="fasta" or "tab") 827 - key_function - Optional callback function which when given a 828 SeqRecord identifier string should return a unique 829 key for the dictionary. 830 831 This indexing function will return a dictionary like object, giving the 832 SeqRecord objects as values: 833 834 >>> from Bio.Alphabet import generic_protein 835 >>> from Bio import SeqIO 836 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 837 >>> def get_gi(name): 838 ... parts = name.split("|") 839 ... i = parts.index("gi") 840 ... assert i != -1 841 ... return parts[i+1] 842 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 843 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 844 >>> len(records) 845 95 846 >>> records["7525076"].description 847 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 848 >>> records["45478717"].description 849 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 850 851 In this example the two files contain 85 and 10 records respectively. 852 853 BGZF compressed files are supported, and detected automatically. Ordinary 854 GZIP compressed files are not supported. 855 856 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict() 857 """ 858 #Try and give helpful error messages: 859 if not isinstance(index_filename, basestring): 860 raise TypeError("Need a string for the index filename") 861 if isinstance(filenames, basestring): 862 #Make the API a little more friendly, and more similar 863 #to Bio.SeqIO.index(...) for indexing just one file. 864 filenames = [filenames] 865 if filenames is not None and not isinstance(filenames, list): 866 raise TypeError( 867 "Need a list of filenames (as strings), or one filename") 868 if format is not None and not isinstance(format, basestring): 869 raise TypeError("Need a string for the file format (lower case)") 870 if format and format != format.lower(): 871 raise ValueError("Format string '%s' should be lower case" % format) 872 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 873 isinstance(alphabet, AlphabetEncoder)): 874 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 875 876 #Map the file format to a sequence iterator: 877 from _index import _FormatToRandomAccess # Lazy import 878 from Bio.File import _SQLiteManySeqFilesDict 879 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 880 % (index_filename, filenames, format, alphabet, key_function) 881 882 def proxy_factory(format, filename=None): 883 """Given a filename returns proxy object, else boolean if format OK.""" 884 if filename: 885 return _FormatToRandomAccess[format](filename, format, alphabet) 886 else: 887 return format in _FormatToRandomAccess
888 889 return _SQLiteManySeqFilesDict(index_filename, filenames, 890 proxy_factory, format, 891 key_function, repr) 892 893
894 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
895 """Convert between two sequence file formats, return number of records. 896 897 - in_file - an input handle or filename 898 - in_format - input file format, lower case string 899 - out_file - an output handle or filename 900 - out_format - output file format, lower case string 901 - alphabet - optional alphabet to assume 902 903 NOTE - If you provide an output filename, it will be opened which will 904 overwrite any existing file without warning. This may happen if even 905 the conversion is aborted (e.g. an invalid out_format name is given). 906 907 For example, going from a filename to a handle: 908 909 >>> from Bio import SeqIO 910 >>> from StringIO import StringIO 911 >>> handle = StringIO("") 912 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 913 3 914 >>> print handle.getvalue() 915 >EAS54_6_R1_2_1_413_324 916 CCCTTCTTGTCTTCAGCGTTTCTCC 917 >EAS54_6_R1_2_1_540_792 918 TTGGCAGGCCAAGGCCGATGGATCA 919 >EAS54_6_R1_2_1_443_348 920 GTTGCTTCTGGCGTGGGTGGGGGGG 921 <BLANKLINE> 922 """ 923 #Hack for SFF, will need to make this more general in future 924 if in_format in _BinaryFormats: 925 in_mode = 'rb' 926 else: 927 in_mode = 'rU' 928 929 #Don't open the output file until we've checked the input is OK? 930 if out_format in ["sff", "sff_trim"]: 931 out_mode = 'wb' 932 else: 933 out_mode = 'w' 934 935 #This will check the arguments and issue error messages, 936 #after we have opened the file which is a shame. 937 from _convert import _handle_convert # Lazy import 938 with as_handle(in_file, in_mode) as in_handle: 939 with as_handle(out_file, out_mode) as out_handle: 940 count = _handle_convert(in_handle, in_format, 941 out_handle, out_format, 942 alphabet) 943 return count
944 945 946 if __name__ == "__main__": 947 from Bio._utils import run_doctest 948 run_doctest() 949