Package Bio :: Package SeqIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.SeqIO

  1  # Copyright 2006-2010 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5  # 
  6  #Nice link: 
  7  # http://www.ebi.ac.uk/help/formats_frame.html 
  8   
  9  r"""Sequence input/output as SeqRecord objects. 
 10   
 11  Bio.SeqIO is also documented at U{http://biopython.org/wiki/SeqIO} and by 
 12  a whole chapter in our tutorial: 
 13   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
 14   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf} 
 15   
 16  Input 
 17  ===== 
 18  The main function is Bio.SeqIO.parse(...) which takes an input file handle 
 19  (or in recent versions of Biopython alternatively a filename as a string), 
 20  and format string.  This returns an iterator giving SeqRecord objects: 
 21   
 22      >>> from Bio import SeqIO 
 23      >>> for record in SeqIO.parse("Fasta/f002", "fasta"): 
 24      ...     print("%s %i" % (record.id, len(record))) 
 25      gi|1348912|gb|G26680|G26680 633 
 26      gi|1348917|gb|G26685|G26685 413 
 27      gi|1592936|gb|G29385|G29385 471 
 28   
 29  Note that the parse() function will invoke the relevant parser for the 
 30  format with its default settings.  You may want more control, in which case 
 31  you need to create a format specific sequence iterator directly. 
 32   
 33  Input - Single Records 
 34  ====================== 
 35  If you expect your file to contain one-and-only-one record, then we provide 
 36  the following 'helper' function which will return a single SeqRecord, or 
 37  raise an exception if there are no records or more than one record: 
 38   
 39      >>> from Bio import SeqIO 
 40      >>> record = SeqIO.read("Fasta/f001", "fasta") 
 41      >>> print("%s %i" % (record.id, len(record))) 
 42      gi|3318709|pdb|1A91| 79 
 43   
 44  This style is useful when you expect a single record only (and would 
 45  consider multiple records an error).  For example, when dealing with GenBank 
 46  files for bacterial genomes or chromosomes, there is normally only a single 
 47  record.  Alternatively, use this with a handle when downloading a single 
 48  record from the internet. 
 49   
 50  However, if you just want the first record from a file containing multiple 
 51  record, use the next() function on the iterator (or under Python 2, the 
 52  iterator's next() method): 
 53   
 54      >>> from Bio import SeqIO 
 55      >>> record = next(SeqIO.parse("Fasta/f002", "fasta")) 
 56      >>> print("%s %i" % (record.id, len(record))) 
 57      gi|1348912|gb|G26680|G26680 633 
 58   
 59  The above code will work as long as the file contains at least one record. 
 60  Note that if there is more than one record, the remaining records will be 
 61  silently ignored. 
 62   
 63   
 64  Input - Multiple Records 
 65  ======================== 
 66  For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records 
 67  using a sequence iterator can save you a lot of memory (RAM).  There is 
 68  less benefit for interlaced file formats (e.g. most multiple alignment file 
 69  formats).  However, an iterator only lets you access the records one by one. 
 70   
 71  If you want random access to the records by number, turn this into a list: 
 72   
 73      >>> from Bio import SeqIO 
 74      >>> records = list(SeqIO.parse("Fasta/f002", "fasta")) 
 75      >>> len(records) 
 76      3 
 77      >>> print(records[1].id) 
 78      gi|1348917|gb|G26685|G26685 
 79   
 80  If you want random access to the records by a key such as the record id, 
 81  turn the iterator into a dictionary: 
 82   
 83      >>> from Bio import SeqIO 
 84      >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta")) 
 85      >>> len(record_dict) 
 86      3 
 87      >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
 88      413 
 89   
 90  However, using list() or the to_dict() function will load all the records 
 91  into memory at once, and therefore is not possible on very large files. 
 92  Instead, for *some* file formats Bio.SeqIO provides an indexing approach 
 93  providing dictionary like access to any record. For example, 
 94   
 95      >>> from Bio import SeqIO 
 96      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
 97      >>> len(record_dict) 
 98      3 
 99      >>> print(len(record_dict["gi|1348917|gb|G26685|G26685"])) 
100      413 
101   
102  Many but not all of the supported input file formats can be indexed like 
103  this. For example "fasta", "fastq", "qual" and even the binary format "sff" 
104  work, but alignment formats like "phylip", "clustalw" and "nexus" will not. 
105   
106  In most cases you can also use SeqIO.index to get the record from the file 
107  as a raw string (not a SeqRecord). This can be useful for example to extract 
108  a sub-set of records from a file where SeqIO cannot output the file format 
109  (e.g. the plain text SwissProt format, "swiss") or where it is important to 
110  keep the output 100% identical to the input). For example, 
111   
112      >>> from Bio import SeqIO 
113      >>> record_dict = SeqIO.index("Fasta/f002", "fasta") 
114      >>> len(record_dict) 
115      3 
116      >>> print(record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()) 
117      >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
118      CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC 
119      ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA 
120      GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA 
121      GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT 
122      TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG 
123      AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
124      <BLANKLINE> 
125      >>> print(record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")) 
126      >gi|1348917|gb|G26685|G26685 human STS STS_D11734. 
127      CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC 
128      TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT 
129      TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA 
130      ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG 
131      AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA 
132      TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG 
133      CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA 
134      <BLANKLINE> 
135   
136  Here the original file and what Biopython would output differ in the line 
137  wrapping. Also note that under Python 3, the get_raw method will return a 
138  bytes string, hence the use of decode to turn it into a (unicode) string. 
139  This is uncessary on Python 2. 
140   
141  Also note that the get_raw method will preserve the newline endings. This 
142  example FASTQ file uses Unix style endings (b"\n" only), 
143   
144      >>> from Bio import SeqIO 
145      >>> fastq_dict = SeqIO.index("Quality/example.fastq", "fastq") 
146      >>> len(fastq_dict) 
147      3 
148      >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
149      >>> raw.count(b"\n") 
150      4 
151      >>> raw.count(b"\r\n") 
152      0 
153      >>> b"\r" in raw 
154      False 
155      >>> len(raw) 
156      78 
157   
158  Here is the same file but using DOS/Windows new lines (b"\r\n" instead), 
159   
160      >>> from Bio import SeqIO 
161      >>> fastq_dict = SeqIO.index("Quality/example_dos.fastq", "fastq") 
162      >>> len(fastq_dict) 
163      3 
164      >>> raw = fastq_dict.get_raw("EAS54_6_R1_2_1_540_792") 
165      >>> raw.count(b"\n") 
166      4 
167      >>> raw.count(b"\r\n") 
168      4 
169      >>> b"\r\n" in raw 
170      True 
171      >>> len(raw) 
172      82 
173   
174  Because this uses two bytes for each new line, the file is longer than 
175  the Unix equivalent with only one byte. 
176   
177   
178  Input - Alignments 
179  ================== 
180  You can read in alignment files as alignment objects using Bio.AlignIO. 
181  Alternatively, reading in an alignment file format via Bio.SeqIO will give 
182  you a SeqRecord for each row of each alignment: 
183   
184      >>> from Bio import SeqIO 
185      >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"): 
186      ...     print("%s %i" % (record.id, len(record))) 
187      gi|167877390|gb|EDS40773.1| 447 
188      gi|167234445|ref|NP_001107837. 447 
189      gi|74100009|gb|AAZ99217.1| 447 
190      gi|13990994|dbj|BAA33523.2| 447 
191      gi|56122354|gb|AAV74328.1| 447 
192   
193   
194  Output 
195  ====== 
196  Use the function Bio.SeqIO.write(...), which takes a complete set of 
197  SeqRecord objects (either as a list, or an iterator), an output file handle 
198  (or in recent versions of Biopython an output filename as a string) and of 
199  course the file format:: 
200   
201      from Bio import SeqIO 
202      records = ... 
203      SeqIO.write(records, "example.faa", "fasta") 
204   
205  Or, using a handle:: 
206   
207      from Bio import SeqIO 
208      records = ... 
209      with open("example.faa", "w") as handle: 
210          SeqIO.write(records, handle, "fasta") 
211   
212  You are expected to call this function once (with all your records) and if 
213  using a handle, make sure you close it to flush the data to the hard disk. 
214   
215   
216  Output - Advanced 
217  ================= 
218  The effect of calling write() multiple times on a single file will vary 
219  depending on the file format, and is best avoided unless you have a strong 
220  reason to do so. 
221   
222  If you give a filename, then each time you call write() the existing file 
223  will be overwriten. For sequential files formats (e.g. fasta, genbank) each 
224  "record block" holds a single sequence.  For these files it would probably 
225  be safe to call write() multiple times by re-using the same handle. 
226   
227   
228  However, trying this for certain alignment formats (e.g. phylip, clustal, 
229  stockholm) would have the effect of concatenating several multiple sequence 
230  alignments together.  Such files are created by the PHYLIP suite of programs 
231  for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead. 
232   
233   
234  Conversion 
235  ========== 
236  The Bio.SeqIO.convert(...) function allows an easy interface for simple 
237  file format conversions. Additionally, it may use file format specific 
238  optimisations so this should be the fastest way too. 
239   
240  In general however, you can combine the Bio.SeqIO.parse(...) function with 
241  the Bio.SeqIO.write(...) function for sequence file conversion. Using 
242  generator expressions or generator functions provides a memory efficient way 
243  to perform filtering or other extra operations as part of the process. 
244   
245   
246  File Formats 
247  ============ 
248  When specifying the file format, use lowercase strings.  The same format 
249  names are also used in Bio.AlignIO and include the following: 
250   
251   - abif    - Applied Biosystem's sequencing trace format 
252   - ace     - Reads the contig sequences from an ACE assembly file. 
253   - embl    - The EMBL flat file format. Uses Bio.GenBank internally. 
254   - fasta   - The generic sequence file format where each record starts with 
255               an identifer line starting with a ">" character, followed by 
256               lines of sequence. 
257   - fastq   - A "FASTA like" format used by Sanger which also stores PHRED 
258               sequence quality values (with an ASCII offset of 33). 
259   - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS 
260   - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which 
261               encodes Solexa quality scores (not PHRED quality scores) with an 
262               ASCII offset of 64. 
263   - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format 
264               which encodes PHRED quality scores with an ASCII offset of 64 
265               (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina 
266               will produce FASTQ files using the standard Sanger encoding. 
267   - genbank - The GenBank or GenPept flat file format. 
268   - gb      - An alias for "genbank", for consistency with NCBI Entrez Utilities 
269   - ig      - The IntelliGenetics file format, apparently the same as the 
270               MASE alignment format. 
271   - imgt    - An EMBL like format from IMGT where the feature tables are more 
272               indented to allow for longer feature types. 
273   - phd     - Output from PHRED, used by PHRAP and CONSED for input. 
274   - pir     - A "FASTA like" format introduced by the National Biomedical 
275               Research Foundation (NBRF) for the Protein Information Resource 
276               (PIR) database, now part of UniProt. 
277   - seqxml  - SeqXML, simple XML format described in Schmitt et al (2011). 
278   - sff     - Standard Flowgram Format (SFF), typical output from Roche 454. 
279   - sff-trim - Standard Flowgram Format (SFF) with given trimming applied. 
280   - swiss   - Plain text Swiss-Prot aka UniProt format. 
281   - tab     - Simple two column tab separated sequence files, where each 
282               line holds a record's identifier and sequence. For example, 
283               this is used as by Aligent's eArray software when saving 
284               microarray probes in a minimal tab delimited text file. 
285   - qual    - A "FASTA like" format holding PHRED quality values from 
286               sequencing DNA, but no actual sequences (usually provided 
287               in separate FASTA files). 
288   - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain 
289               text format which we call "swiss") 
290   
291  Note that while Bio.SeqIO can read all the above file formats, it cannot 
292  write to all of them. 
293   
294  You can also use any file format supported by Bio.AlignIO, such as "nexus", 
295  "phlip" and "stockholm", which gives you access to the individual sequences 
296  making up each alignment as SeqRecords. 
297  """ 
298   
299  from __future__ import print_function 
300  from Bio._py3k import basestring 
301   
302  __docformat__ = "epytext en"  # not just plaintext 
303   
304  #TODO 
305  # - define policy on reading aligned sequences with gaps in 
306  #   (e.g. - and . characters) including how the alphabet interacts 
307  # 
308  # - How best to handle unique/non unique record.id when writing. 
309  #   For most file formats reading such files is fine; The stockholm 
310  #   parser would fail. 
311  # 
312  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
313  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
314   
315  """ 
316  FAO BioPython Developers 
317  ======================== 
318  The way I envision this SeqIO system working as that for any sequence file 
319  format we have an iterator that returns SeqRecord objects. 
320   
321  This also applies to interlaced fileformats (like clustal - although that 
322  is now handled via Bio.AlignIO instead) where the file cannot be read record 
323  by record.  You should still return an iterator, even if the implementation 
324  could just as easily return a list. 
325   
326  These file format specific sequence iterators may be implemented as: 
327  * Classes which take a handle for __init__ and provide the __iter__ method 
328  * Functions that take a handle, and return an iterator object 
329  * Generator functions that take a handle, and yield SeqRecord objects 
330   
331  It is then trivial to turn this iterator into a list of SeqRecord objects, 
332  an in memory dictionary, or a multiple sequence alignment object. 
333   
334  For building the dictionary by default the id propery of each SeqRecord is 
335  used as the key.  You should always populate the id property, and it should 
336  be unique in most cases. For some file formats the accession number is a good 
337  choice.  If the file itself contains ambiguous identifiers, don't try and 
338  dis-ambiguate them - return them as is. 
339   
340  When adding a new file format, please use the same lower case format name 
341  as BioPerl, or if they have not defined one, try the names used by EMBOSS. 
342   
343  See also http://biopython.org/wiki/SeqIO_dev 
344   
345  --Peter 
346  """ 
347   
348   
349  from Bio.File import as_handle 
350  from Bio.SeqRecord import SeqRecord 
351  from Bio.Align import MultipleSeqAlignment 
352  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
353   
354  from . import AbiIO 
355  from . import AceIO 
356  from . import FastaIO 
357  from . import IgIO  # IntelliGenetics or MASE format 
358  from . import InsdcIO  # EMBL and GenBank 
359  from . import PdbIO 
360  from . import PhdIO 
361  from . import PirIO 
362  from . import SeqXmlIO 
363  from . import SffIO 
364  from . import SwissIO 
365  from . import TabIO 
366  from . import QualityIO  # FastQ and qual files 
367  from . import UniprotIO 
368   
369   
370  #Convention for format names is "mainname-subtype" in lower case. 
371  #Please use the same names as BioPerl or EMBOSS where possible. 
372  # 
373  #Note that this simple system copes with defining 
374  #multiple possible iterators for a given format/extension 
375  #with the -subtype suffix 
376  # 
377  #Most alignment file formats will be handled via Bio.AlignIO 
378   
379  _FormatToIterator = {"fasta": FastaIO.FastaIterator, 
380                       "gb": InsdcIO.GenBankIterator, 
381                       "genbank": InsdcIO.GenBankIterator, 
382                       "genbank-cds": InsdcIO.GenBankCdsFeatureIterator, 
383                       "embl": InsdcIO.EmblIterator, 
384                       "embl-cds": InsdcIO.EmblCdsFeatureIterator, 
385                       "imgt": InsdcIO.ImgtIterator, 
386                       "ig": IgIO.IgIterator, 
387                       "swiss": SwissIO.SwissIterator, 
388                       "pdb-atom": PdbIO.PdbAtomIterator, 
389                       "pdb-seqres": PdbIO.PdbSeqresIterator, 
390                       "phd": PhdIO.PhdIterator, 
391                       "ace": AceIO.AceIterator, 
392                       "tab": TabIO.TabIterator, 
393                       "pir": PirIO.PirIterator, 
394                       "fastq": QualityIO.FastqPhredIterator, 
395                       "fastq-sanger": QualityIO.FastqPhredIterator, 
396                       "fastq-solexa": QualityIO.FastqSolexaIterator, 
397                       "fastq-illumina": QualityIO.FastqIlluminaIterator, 
398                       "qual": QualityIO.QualPhredIterator, 
399                       "sff": SffIO.SffIterator, 
400                       #Not sure about this in the long run: 
401                       "sff-trim": SffIO._SffTrimIterator, 
402                       "uniprot-xml": UniprotIO.UniprotIterator, 
403                       "seqxml": SeqXmlIO.SeqXmlIterator, 
404                       "abi": AbiIO.AbiIterator, 
405                       "abi-trim": AbiIO._AbiTrimIterator, 
406                       } 
407   
408  _FormatToWriter = {"fasta": FastaIO.FastaWriter, 
409                     "gb": InsdcIO.GenBankWriter, 
410                     "genbank": InsdcIO.GenBankWriter, 
411                     "embl": InsdcIO.EmblWriter, 
412                     "imgt": InsdcIO.ImgtWriter, 
413                     "tab": TabIO.TabWriter, 
414                     "fastq": QualityIO.FastqPhredWriter, 
415                     "fastq-sanger": QualityIO.FastqPhredWriter, 
416                     "fastq-solexa": QualityIO.FastqSolexaWriter, 
417                     "fastq-illumina": QualityIO.FastqIlluminaWriter, 
418                     "phd": PhdIO.PhdWriter, 
419                     "qual": QualityIO.QualPhredWriter, 
420                     "sff": SffIO.SffWriter, 
421                     "seqxml": SeqXmlIO.SeqXmlWriter, 
422                     } 
423   
424  _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"] 
425   
426   
427 -def write(sequences, handle, format):
428 """Write complete set of sequences to a file. 429 430 - sequences - A list (or iterator) of SeqRecord objects, or (if using 431 Biopython 1.54 or later) a single SeqRecord. 432 - handle - File handle object to write to, or filename as string 433 (note older versions of Biopython only took a handle). 434 - format - lower case string describing the file format to write. 435 436 You should close the handle after calling this function. 437 438 Returns the number of records written (as an integer). 439 """ 440 from Bio import AlignIO 441 442 #Try and give helpful error messages: 443 if not isinstance(format, basestring): 444 raise TypeError("Need a string for the file format (lower case)") 445 if not format: 446 raise ValueError("Format required (lower case string)") 447 if format != format.lower(): 448 raise ValueError("Format string '%s' should be lower case" % format) 449 450 if isinstance(sequences, SeqRecord): 451 #This raised an exception in order version of Biopython 452 sequences = [sequences] 453 454 if format in _BinaryFormats: 455 mode = 'wb' 456 else: 457 mode = 'w' 458 459 with as_handle(handle, mode) as fp: 460 #Map the file format to a writer class 461 if format in _FormatToWriter: 462 writer_class = _FormatToWriter[format] 463 count = writer_class(fp).write_file(sequences) 464 elif format in AlignIO._FormatToWriter: 465 #Try and turn all the records into a single alignment, 466 #and write that using Bio.AlignIO 467 alignment = MultipleSeqAlignment(sequences) 468 alignment_count = AlignIO.write([alignment], fp, format) 469 assert alignment_count == 1, \ 470 "Internal error - the underlying writer " \ 471 " should have returned 1, not %s" % repr(alignment_count) 472 count = len(alignment) 473 del alignment_count, alignment 474 elif format in _FormatToIterator or format in AlignIO._FormatToIterator: 475 raise ValueError("Reading format '%s' is supported, but not writing" 476 % format) 477 else: 478 raise ValueError("Unknown format '%s'" % format) 479 480 assert isinstance(count, int), "Internal error - the underlying %s " \ 481 "writer should have returned the record count, not %s" \ 482 % (format, repr(count)) 483 484 return count
485 486
487 -def parse(handle, format, alphabet=None):
488 r"""Turns a sequence file into an iterator returning SeqRecords. 489 490 - handle - handle to the file, or the filename as a string 491 (note older versions of Biopython only took a handle). 492 - format - lower case string describing the file format. 493 - alphabet - optional Alphabet object, useful when the sequence type 494 cannot be automatically inferred from the file itself 495 (e.g. format="fasta" or "tab") 496 497 Typical usage, opening a file to read in, and looping over the record(s): 498 499 >>> from Bio import SeqIO 500 >>> filename = "Fasta/sweetpea.nu" 501 >>> for record in SeqIO.parse(filename, "fasta"): 502 ... print("ID %s" % record.id) 503 ... print("Sequence length %i" % len(record)) 504 ... print("Sequence alphabet %s" % record.seq.alphabet) 505 ID gi|3176602|gb|U78617.1|LOU78617 506 Sequence length 309 507 Sequence alphabet SingleLetterAlphabet() 508 509 For file formats like FASTA where the alphabet cannot be determined, it 510 may be useful to specify the alphabet explicitly: 511 512 >>> from Bio import SeqIO 513 >>> from Bio.Alphabet import generic_dna 514 >>> filename = "Fasta/sweetpea.nu" 515 >>> for record in SeqIO.parse(filename, "fasta", generic_dna): 516 ... print("ID %s" % record.id) 517 ... print("Sequence length %i" % len(record)) 518 ... print("Sequence alphabet %s" % record.seq.alphabet) 519 ID gi|3176602|gb|U78617.1|LOU78617 520 Sequence length 309 521 Sequence alphabet DNAAlphabet() 522 523 If you have a string 'data' containing the file contents, you must 524 first turn this into a handle in order to parse it: 525 526 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n" 527 >>> from Bio import SeqIO 528 >>> try: 529 ... from StringIO import StringIO # Python 2 530 ... except ImportError: 531 ... from io import StringIO # Python 3 532 ... 533 >>> for record in SeqIO.parse(StringIO(data), "fasta"): 534 ... print("%s %s" % (record.id, record.seq)) 535 Alpha ACCGGATGTA 536 Beta AGGCTCGGTTA 537 538 Use the Bio.SeqIO.read(...) function when you expect a single record 539 only. 540 """ 541 #NOTE - The above docstring has some raw \n characters needed 542 #for the StringIO example, hense the whole docstring is in raw 543 #string mode (see the leading r before the opening quote). 544 from Bio import AlignIO 545 546 #Hack for SFF, will need to make this more general in future 547 if format in _BinaryFormats: 548 mode = 'rb' 549 else: 550 mode = 'rU' 551 552 #Try and give helpful error messages: 553 if not isinstance(format, basestring): 554 raise TypeError("Need a string for the file format (lower case)") 555 if not format: 556 raise ValueError("Format required (lower case string)") 557 if format != format.lower(): 558 raise ValueError("Format string '%s' should be lower case" % format) 559 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 560 isinstance(alphabet, AlphabetEncoder)): 561 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 562 563 with as_handle(handle, mode) as fp: 564 #Map the file format to a sequence iterator: 565 if format in _FormatToIterator: 566 iterator_generator = _FormatToIterator[format] 567 if alphabet is None: 568 i = iterator_generator(fp) 569 else: 570 try: 571 i = iterator_generator(fp, alphabet=alphabet) 572 except TypeError: 573 i = _force_alphabet(iterator_generator(fp), alphabet) 574 elif format in AlignIO._FormatToIterator: 575 #Use Bio.AlignIO to read in the alignments 576 i = (r for alignment in AlignIO.parse(fp, format, 577 alphabet=alphabet) 578 for r in alignment) 579 else: 580 raise ValueError("Unknown format '%s'" % format) 581 #This imposes some overhead... wait until we drop Python 2.4 to fix it 582 for r in i: 583 yield r
584 585
586 -def _force_alphabet(record_iterator, alphabet):
587 """Iterate over records, over-riding the alphabet (PRIVATE).""" 588 #Assume the alphabet argument has been pre-validated 589 given_base_class = _get_base_alphabet(alphabet).__class__ 590 for record in record_iterator: 591 if isinstance(_get_base_alphabet(record.seq.alphabet), 592 given_base_class): 593 record.seq.alphabet = alphabet 594 yield record 595 else: 596 raise ValueError("Specified alphabet %s clashes with " 597 "that determined from the file, %s" 598 % (repr(alphabet), repr(record.seq.alphabet)))
599 600
601 -def read(handle, format, alphabet=None):
602 """Turns a sequence file into a single SeqRecord. 603 604 - handle - handle to the file, or the filename as a string 605 (note older versions of Biopython only took a handle). 606 - format - string describing the file format. 607 - alphabet - optional Alphabet object, useful when the sequence type 608 cannot be automatically inferred from the file itself 609 (e.g. format="fasta" or "tab") 610 611 This function is for use parsing sequence files containing 612 exactly one record. For example, reading a GenBank file: 613 614 >>> from Bio import SeqIO 615 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank") 616 >>> print("ID %s" % record.id) 617 ID AC007323.5 618 >>> print("Sequence length %i" % len(record)) 619 Sequence length 86436 620 >>> print("Sequence alphabet %s" % record.seq.alphabet) 621 Sequence alphabet IUPACAmbiguousDNA() 622 623 If the handle contains no records, or more than one record, 624 an exception is raised. For example: 625 626 >>> from Bio import SeqIO 627 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank") 628 Traceback (most recent call last): 629 ... 630 ValueError: More than one record found in handle 631 632 If however you want the first record from a file containing 633 multiple records this function would raise an exception (as 634 shown in the example above). Instead use: 635 636 >>> from Bio import SeqIO 637 >>> record = next(SeqIO.parse("GenBank/cor6_6.gb", "genbank")) 638 >>> print("First record's ID %s" % record.id) 639 First record's ID X55053.1 640 641 Use the Bio.SeqIO.parse(handle, format) function if you want 642 to read multiple records from the handle. 643 """ 644 iterator = parse(handle, format, alphabet) 645 try: 646 first = next(iterator) 647 except StopIteration: 648 first = None 649 if first is None: 650 raise ValueError("No records found in handle") 651 try: 652 second = next(iterator) 653 except StopIteration: 654 second = None 655 if second is not None: 656 raise ValueError("More than one record found in handle") 657 return first
658 659
660 -def to_dict(sequences, key_function=None):
661 """Turns a sequence iterator or list into a dictionary. 662 663 - sequences - An iterator that returns SeqRecord objects, 664 or simply a list of SeqRecord objects. 665 - key_function - Optional callback function which when given a 666 SeqRecord should return a unique key for the dictionary. 667 668 e.g. key_function = lambda rec : rec.name 669 or, key_function = lambda rec : rec.description.split()[0] 670 671 If key_function is omitted then record.id is used, on the assumption 672 that the records objects returned are SeqRecords with a unique id. 673 674 If there are duplicate keys, an error is raised. 675 676 Example usage, defaulting to using the record.id as key: 677 678 >>> from Bio import SeqIO 679 >>> filename = "GenBank/cor6_6.gb" 680 >>> format = "genbank" 681 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format)) 682 >>> print(sorted(id_dict)) 683 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1'] 684 >>> print(id_dict["L31939.1"].description) 685 Brassica rapa (clone bif72) kin mRNA, complete cds. 686 687 A more complex example, using the key_function argument in order to 688 use a sequence checksum as the dictionary key: 689 690 >>> from Bio import SeqIO 691 >>> from Bio.SeqUtils.CheckSum import seguid 692 >>> filename = "GenBank/cor6_6.gb" 693 >>> format = "genbank" 694 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format), 695 ... key_function = lambda rec : seguid(rec.seq)) 696 >>> for key, record in sorted(seguid_dict.items()): 697 ... print("%s %s" % (key, record.id)) 698 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1 699 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1 700 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1 701 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1 702 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1 703 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1 704 705 This approach is not suitable for very large sets of sequences, as all 706 the SeqRecord objects are held in memory. Instead, consider using the 707 Bio.SeqIO.index() function (if it supports your particular file format). 708 """ 709 if key_function is None: 710 key_function = lambda rec: rec.id 711 712 d = dict() 713 for record in sequences: 714 key = key_function(record) 715 if key in d: 716 raise ValueError("Duplicate key '%s'" % key) 717 d[key] = record 718 return d
719 720
721 -def index(filename, format, alphabet=None, key_function=None):
722 """Indexes a sequence file and returns a dictionary like object. 723 724 - filename - string giving name of file to be indexed 725 - format - lower case string describing the file format 726 - alphabet - optional Alphabet object, useful when the sequence type 727 cannot be automatically inferred from the file itself 728 (e.g. format="fasta" or "tab") 729 - key_function - Optional callback function which when given a 730 SeqRecord identifier string should return a unique 731 key for the dictionary. 732 733 This indexing function will return a dictionary like object, giving the 734 SeqRecord objects as values: 735 736 >>> from Bio import SeqIO 737 >>> records = SeqIO.index("Quality/example.fastq", "fastq") 738 >>> len(records) 739 3 740 >>> sorted(records) 741 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 742 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 743 >EAS54_6_R1_2_1_540_792 744 TTGGCAGGCCAAGGCCGATGGATCA 745 <BLANKLINE> 746 >>> "EAS54_6_R1_2_1_540_792" in records 747 True 748 >>> print(records.get("Missing", None)) 749 None 750 751 If the file is BGZF compressed, this is detected automatically. Ordinary 752 GZIP files are not supported: 753 754 >>> from Bio import SeqIO 755 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq") 756 >>> len(records) 757 3 758 >>> print(records["EAS54_6_R1_2_1_540_792"].seq) 759 TTGGCAGGCCAAGGCCGATGGATCA 760 761 Note that this pseudo dictionary will not support all the methods of a 762 true Python dictionary, for example values() is not defined since this 763 would require loading all of the records into memory at once. 764 765 When you call the index function, it will scan through the file, noting 766 the location of each record. When you access a particular record via the 767 dictionary methods, the code will jump to the appropriate part of the 768 file and then parse that section into a SeqRecord. 769 770 Note that not all the input formats supported by Bio.SeqIO can be used 771 with this index function. It is designed to work only with sequential 772 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any 773 interlaced file format (e.g. alignment formats such as "clustal"). 774 775 For small files, it may be more efficient to use an in memory Python 776 dictionary, e.g. 777 778 >>> from Bio import SeqIO 779 >>> records = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fastq"), "fastq")) 780 >>> len(records) 781 3 782 >>> sorted(records) 783 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792'] 784 >>> print(records["EAS54_6_R1_2_1_540_792"].format("fasta")) 785 >EAS54_6_R1_2_1_540_792 786 TTGGCAGGCCAAGGCCGATGGATCA 787 <BLANKLINE> 788 789 As with the to_dict() function, by default the id string of each record 790 is used as the key. You can specify a callback function to transform 791 this (the record identifier string) into your preferred key. For example: 792 793 >>> from Bio import SeqIO 794 >>> def make_tuple(identifier): 795 ... parts = identifier.split("_") 796 ... return int(parts[-2]), int(parts[-1]) 797 >>> records = SeqIO.index("Quality/example.fastq", "fastq", 798 ... key_function=make_tuple) 799 >>> len(records) 800 3 801 >>> sorted(records) 802 [(413, 324), (443, 348), (540, 792)] 803 >>> print(records[(540, 792)].format("fasta")) 804 >EAS54_6_R1_2_1_540_792 805 TTGGCAGGCCAAGGCCGATGGATCA 806 <BLANKLINE> 807 >>> (540, 792) in records 808 True 809 >>> "EAS54_6_R1_2_1_540_792" in records 810 False 811 >>> print(records.get("Missing", None)) 812 None 813 814 Another common use case would be indexing an NCBI style FASTA file, 815 where you might want to extract the GI number from the FASTA identifer 816 to use as the dictionary key. 817 818 Notice that unlike the to_dict() function, here the key_function does 819 not get given the full SeqRecord to use to generate the key. Doing so 820 would impose a severe performance penalty as it would require the file 821 to be completely parsed while building the index. Right now this is 822 usually avoided. 823 824 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict() 825 """ 826 #Try and give helpful error messages: 827 if not isinstance(filename, basestring): 828 raise TypeError("Need a filename (not a handle)") 829 if not isinstance(format, basestring): 830 raise TypeError("Need a string for the file format (lower case)") 831 if not format: 832 raise ValueError("Format required (lower case string)") 833 if format != format.lower(): 834 raise ValueError("Format string '%s' should be lower case" % format) 835 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 836 isinstance(alphabet, AlphabetEncoder)): 837 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 838 839 #Map the file format to a sequence iterator: 840 from ._index import _FormatToRandomAccess # Lazy import 841 from Bio.File import _IndexedSeqFileDict 842 try: 843 proxy_class = _FormatToRandomAccess[format] 844 except KeyError: 845 raise ValueError("Unsupported format %r" % format) 846 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \ 847 % (filename, format, alphabet, key_function) 848 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet), 849 key_function, repr, "SeqRecord")
850 851
852 -def index_db(index_filename, filenames=None, format=None, alphabet=None, 853 key_function=None):
854 """Index several sequence files and return a dictionary like object. 855 856 The index is stored in an SQLite database rather than in memory (as in the 857 Bio.SeqIO.index(...) function). 858 859 - index_filename - Where to store the SQLite index 860 - filenames - list of strings specifying file(s) to be indexed, or when 861 indexing a single file this can be given as a string. 862 (optional if reloading an existing index, but must match) 863 - format - lower case string describing the file format 864 (optional if reloading an existing index, but must match) 865 - alphabet - optional Alphabet object, useful when the sequence type 866 cannot be automatically inferred from the file itself 867 (e.g. format="fasta" or "tab") 868 - key_function - Optional callback function which when given a 869 SeqRecord identifier string should return a unique 870 key for the dictionary. 871 872 This indexing function will return a dictionary like object, giving the 873 SeqRecord objects as values: 874 875 >>> from Bio.Alphabet import generic_protein 876 >>> from Bio import SeqIO 877 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"] 878 >>> def get_gi(name): 879 ... parts = name.split("|") 880 ... i = parts.index("gi") 881 ... assert i != -1 882 ... return parts[i+1] 883 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test 884 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi) 885 >>> len(records) 886 95 887 >>> records["7525076"].description 888 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]' 889 >>> records["45478717"].description 890 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]' 891 892 In this example the two files contain 85 and 10 records respectively. 893 894 BGZF compressed files are supported, and detected automatically. Ordinary 895 GZIP compressed files are not supported. 896 897 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict() 898 """ 899 #Try and give helpful error messages: 900 if not isinstance(index_filename, basestring): 901 raise TypeError("Need a string for the index filename") 902 if isinstance(filenames, basestring): 903 #Make the API a little more friendly, and more similar 904 #to Bio.SeqIO.index(...) for indexing just one file. 905 filenames = [filenames] 906 if filenames is not None and not isinstance(filenames, list): 907 raise TypeError( 908 "Need a list of filenames (as strings), or one filename") 909 if format is not None and not isinstance(format, basestring): 910 raise TypeError("Need a string for the file format (lower case)") 911 if format and format != format.lower(): 912 raise ValueError("Format string '%s' should be lower case" % format) 913 if alphabet is not None and not (isinstance(alphabet, Alphabet) or 914 isinstance(alphabet, AlphabetEncoder)): 915 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 916 917 #Map the file format to a sequence iterator: 918 from ._index import _FormatToRandomAccess # Lazy import 919 from Bio.File import _SQLiteManySeqFilesDict 920 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \ 921 % (index_filename, filenames, format, alphabet, key_function) 922 923 def proxy_factory(format, filename=None): 924 """Given a filename returns proxy object, else boolean if format OK.""" 925 if filename: 926 return _FormatToRandomAccess[format](filename, format, alphabet) 927 else: 928 return format in _FormatToRandomAccess
929 930 return _SQLiteManySeqFilesDict(index_filename, filenames, 931 proxy_factory, format, 932 key_function, repr) 933 934
935 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
936 """Convert between two sequence file formats, return number of records. 937 938 - in_file - an input handle or filename 939 - in_format - input file format, lower case string 940 - out_file - an output handle or filename 941 - out_format - output file format, lower case string 942 - alphabet - optional alphabet to assume 943 944 NOTE - If you provide an output filename, it will be opened which will 945 overwrite any existing file without warning. This may happen if even 946 the conversion is aborted (e.g. an invalid out_format name is given). 947 948 For example, going from a filename to a handle: 949 950 >>> from Bio import SeqIO 951 >>> try: 952 ... from StringIO import StringIO # Python 2 953 ... except ImportError: 954 ... from io import StringIO # Python 3 955 ... 956 >>> handle = StringIO("") 957 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta") 958 3 959 >>> print(handle.getvalue()) 960 >EAS54_6_R1_2_1_413_324 961 CCCTTCTTGTCTTCAGCGTTTCTCC 962 >EAS54_6_R1_2_1_540_792 963 TTGGCAGGCCAAGGCCGATGGATCA 964 >EAS54_6_R1_2_1_443_348 965 GTTGCTTCTGGCGTGGGTGGGGGGG 966 <BLANKLINE> 967 """ 968 #Hack for SFF, will need to make this more general in future 969 if in_format in _BinaryFormats: 970 in_mode = 'rb' 971 else: 972 in_mode = 'rU' 973 974 #Don't open the output file until we've checked the input is OK? 975 if out_format in ["sff", "sff_trim"]: 976 out_mode = 'wb' 977 else: 978 out_mode = 'w' 979 980 #This will check the arguments and issue error messages, 981 #after we have opened the file which is a shame. 982 from ._convert import _handle_convert # Lazy import 983 with as_handle(in_file, in_mode) as in_handle: 984 with as_handle(out_file, out_mode) as out_handle: 985 count = _handle_convert(in_handle, in_format, 986 out_handle, out_format, 987 alphabet) 988 return count
989 990 991 if __name__ == "__main__": 992 from Bio._utils import run_doctest 993 run_doctest() 994