1
2
3
4
5
6
7
8
9 """Sequence input/output as SeqRecord objects.
10
11 Bio.SeqIO is also documented at U{http://biopython.org/wiki/SeqIO} and by
12 a whole chapter in our tutorial:
13 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
14 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
15
16 Input
17 =====
18 The main function is Bio.SeqIO.parse(...) which takes an input file handle
19 (or in recent versions of Biopython alternatively a filename as a string),
20 and format string. This returns an iterator giving SeqRecord objects:
21
22 >>> from Bio import SeqIO
23 >>> for record in SeqIO.parse("Fasta/f002", "fasta"):
24 ... print record.id, len(record)
25 gi|1348912|gb|G26680|G26680 633
26 gi|1348917|gb|G26685|G26685 413
27 gi|1592936|gb|G29385|G29385 471
28
29 Note that the parse() function will invoke the relevant parser for the
30 format with its default settings. You may want more control, in which case
31 you need to create a format specific sequence iterator directly.
32
33 Input - Single Records
34 ======================
35 If you expect your file to contain one-and-only-one record, then we provide
36 the following 'helper' function which will return a single SeqRecord, or
37 raise an exception if there are no records or more than one record:
38
39 >>> from Bio import SeqIO
40 >>> record = SeqIO.read("Fasta/f001", "fasta")
41 >>> print record.id, len(record)
42 gi|3318709|pdb|1A91| 79
43
44 This style is useful when you expect a single record only (and would
45 consider multiple records an error). For example, when dealing with GenBank
46 files for bacterial genomes or chromosomes, there is normally only a single
47 record. Alternatively, use this with a handle when downloading a single
48 record from the internet.
49
50 However, if you just want the first record from a file containing multiple
51 record, use the iterator's next() method:
52
53 >>> from Bio import SeqIO
54 >>> record = SeqIO.parse("Fasta/f002", "fasta").next()
55 >>> print record.id, len(record)
56 gi|1348912|gb|G26680|G26680 633
57
58 The above code will work as long as the file contains at least one record.
59 Note that if there is more than one record, the remaining records will be
60 silently ignored.
61
62
63 Input - Multiple Records
64 ========================
65 For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
66 using a sequence iterator can save you a lot of memory (RAM). There is
67 less benefit for interlaced file formats (e.g. most multiple alignment file
68 formats). However, an iterator only lets you access the records one by one.
69
70 If you want random access to the records by number, turn this into a list:
71
72 >>> from Bio import SeqIO
73 >>> records = list(SeqIO.parse("Fasta/f002", "fasta"))
74 >>> len(records)
75 3
76 >>> print records[1].id
77 gi|1348917|gb|G26685|G26685
78
79 If you want random access to the records by a key such as the record id,
80 turn the iterator into a dictionary:
81
82 >>> from Bio import SeqIO
83 >>> record_dict = SeqIO.to_dict(SeqIO.parse("Fasta/f002", "fasta"))
84 >>> len(record_dict)
85 3
86 >>> print len(record_dict["gi|1348917|gb|G26685|G26685"])
87 413
88
89 However, using list() or the to_dict() function will load all the records
90 into memory at once, and therefore is not possible on very large files.
91 Instead, for *some* file formats Bio.SeqIO provides an indexing approach
92 providing dictionary like access to any record. For example,
93
94 >>> from Bio import SeqIO
95 >>> record_dict = SeqIO.index("Fasta/f002", "fasta")
96 >>> len(record_dict)
97 3
98 >>> print len(record_dict["gi|1348917|gb|G26685|G26685"])
99 413
100
101 Many but not all of the supported input file formats can be indexed like
102 this. For example "fasta", "fastq", "qual" and even the binary format "sff"
103 work, but alignment formats like "phylip", "clustalw" and "nexus" will not.
104
105 In most cases you can also use SeqIO.index to get the record from the file
106 as a raw string (not a SeqRecord). This can be useful for example to extract
107 a sub-set of records from a file where SeqIO cannot output the file format
108 (e.g. the plain text SwissProt format, "swiss") or where it is important to
109 keep the output 100% identical to the input). For example,
110
111 >>> from Bio import SeqIO
112 >>> record_dict = SeqIO.index("Fasta/f002", "fasta")
113 >>> len(record_dict)
114 3
115 >>> print record_dict.get_raw("gi|1348917|gb|G26685|G26685").decode()
116 >gi|1348917|gb|G26685|G26685 human STS STS_D11734.
117 CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATCTTACTCTTTC
118 ATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTTTACAGATGTGAAACTTTCAA
119 GAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAAACCTGATGCTTTTATAAGCCATTGTGATTA
120 GGATGACTGTTACAGGCTTAGCTTTGTGTGAAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGT
121 TCATATTACTNTAAGTTCTATAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGG
122 AATTGNTTTGCCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA
123 <BLANKLINE>
124 >>> print record_dict["gi|1348917|gb|G26685|G26685"].format("fasta")
125 >gi|1348917|gb|G26685|G26685 human STS STS_D11734.
126 CGGAGCCAGCGAGCATATGCTGCATGAGGACCTTTCTATCTTACATTATGGCTGGGAATC
127 TTACTCTTTCATCTGATACCTTGTTCAGATTTCAAAATAGTTGTAGCCTTATCCTGGTTT
128 TACAGATGTGAAACTTTCAAGAGATTTACTGACTTTCCTAGAATAGTTTCTCTACTGGAA
129 ACCTGATGCTTTTATAAGCCATTGTGATTAGGATGACTGTTACAGGCTTAGCTTTGTGTG
130 AAANCCAGTCACCTTTCTCCTAGGTAATGAGTAGTGCTGTTCATATTACTNTAAGTTCTA
131 TAGCATACTTGCNATCCTTTANCCATGCTTATCATANGTACCATTTGAGGAATTGNTTTG
132 CCCTTTTGGGTTTNTTNTTGGTAAANNNTTCCCGGGTGGGGGNGGTNNNGAAA
133 <BLANKLINE>
134
135 Here the original file and what Biopython would output differ in the line
136 wrapping. Also note that under Python 3, the get_raw method will return a
137 bytes string, hence the use of decode to turn it into a (unicode) string.
138 This is uncessary on Python 2.
139
140
141 Input - Alignments
142 ==================
143 You can read in alignment files as alignment objects using Bio.AlignIO.
144 Alternatively, reading in an alignment file format via Bio.SeqIO will give
145 you a SeqRecord for each row of each alignment:
146
147 >>> from Bio import SeqIO
148 >>> for record in SeqIO.parse("Clustalw/hedgehog.aln", "clustal"):
149 ... print record.id, len(record)
150 gi|167877390|gb|EDS40773.1| 447
151 gi|167234445|ref|NP_001107837. 447
152 gi|74100009|gb|AAZ99217.1| 447
153 gi|13990994|dbj|BAA33523.2| 447
154 gi|56122354|gb|AAV74328.1| 447
155
156 Output
157 ======
158 Use the function Bio.SeqIO.write(...), which takes a complete set of
159 SeqRecord objects (either as a list, or an iterator), an output file handle
160 (or in recent versions of Biopython an output filename as a string) and of
161 course the file format::
162
163 from Bio import SeqIO
164 records = ...
165 SeqIO.write(records, "example.faa", "fasta")
166
167 Or, using a handle::
168
169 from Bio import SeqIO
170 records = ...
171 handle = open("example.faa", "w")
172 SeqIO.write(records, handle, "fasta")
173 handle.close()
174
175 You are expected to call this function once (with all your records) and if
176 using a handle, make sure you close it to flush the data to the hard disk.
177
178
179 Output - Advanced
180 =================
181 The effect of calling write() multiple times on a single file will vary
182 depending on the file format, and is best avoided unless you have a strong
183 reason to do so.
184
185 If you give a filename, then each time you call write() the existing file
186 will be overwriten. For sequential files formats (e.g. fasta, genbank) each
187 "record block" holds a single sequence. For these files it would probably
188 be safe to call write() multiple times by re-using the same handle.
189
190
191 However, trying this for certain alignment formats (e.g. phylip, clustal,
192 stockholm) would have the effect of concatenating several multiple sequence
193 alignments together. Such files are created by the PHYLIP suite of programs
194 for bootstrap analysis, but it is clearer to do this via Bio.AlignIO instead.
195
196
197 Conversion
198 ==========
199 The Bio.SeqIO.convert(...) function allows an easy interface for simple
200 file format conversions. Additionally, it may use file format specific
201 optimisations so this should be the fastest way too.
202
203 In general however, you can combine the Bio.SeqIO.parse(...) function with
204 the Bio.SeqIO.write(...) function for sequence file conversion. Using
205 generator expressions or generator functions provides a memory efficient way
206 to perform filtering or other extra operations as part of the process.
207
208
209 File Formats
210 ============
211 When specifying the file format, use lowercase strings. The same format
212 names are also used in Bio.AlignIO and include the following:
213
214 - abif - Applied Biosystem's sequencing trace format
215 - ace - Reads the contig sequences from an ACE assembly file.
216 - embl - The EMBL flat file format. Uses Bio.GenBank internally.
217 - fasta - The generic sequence file format where each record starts with
218 an identifer line starting with a ">" character, followed by
219 lines of sequence.
220 - fastq - A "FASTA like" format used by Sanger which also stores PHRED
221 sequence quality values (with an ASCII offset of 33).
222 - fastq-sanger - An alias for "fastq" for consistency with BioPerl and EMBOSS
223 - fastq-solexa - Original Solexa/Illumnia variant of the FASTQ format which
224 encodes Solexa quality scores (not PHRED quality scores) with an
225 ASCII offset of 64.
226 - fastq-illumina - Solexa/Illumina 1.3 to 1.7 variant of the FASTQ format
227 which encodes PHRED quality scores with an ASCII offset of 64
228 (not 33). Note as of version 1.8 of the CASAVA pipeline Illumina
229 will produce FASTQ files using the standard Sanger encoding.
230 - genbank - The GenBank or GenPept flat file format.
231 - gb - An alias for "genbank", for consistency with NCBI Entrez Utilities
232 - ig - The IntelliGenetics file format, apparently the same as the
233 MASE alignment format.
234 - imgt - An EMBL like format from IMGT where the feature tables are more
235 indented to allow for longer feature types.
236 - phd - Output from PHRED, used by PHRAP and CONSED for input.
237 - pir - A "FASTA like" format introduced by the National Biomedical
238 Research Foundation (NBRF) for the Protein Information Resource
239 (PIR) database, now part of UniProt.
240 - seqxml - SeqXML, simple XML format described in Schmitt et al (2011).
241 - sff - Standard Flowgram Format (SFF), typical output from Roche 454.
242 - sff-trim - Standard Flowgram Format (SFF) with given trimming applied.
243 - swiss - Plain text Swiss-Prot aka UniProt format.
244 - tab - Simple two column tab separated sequence files, where each
245 line holds a record's identifier and sequence. For example,
246 this is used as by Aligent's eArray software when saving
247 microarray probes in a minimal tab delimited text file.
248 - qual - A "FASTA like" format holding PHRED quality values from
249 sequencing DNA, but no actual sequences (usually provided
250 in separate FASTA files).
251 - uniprot-xml - The UniProt XML format (replacement for the SwissProt plain
252 text format which we call "swiss")
253
254 Note that while Bio.SeqIO can read all the above file formats, it cannot
255 write to all of them.
256
257 You can also use any file format supported by Bio.AlignIO, such as "nexus",
258 "phlip" and "stockholm", which gives you access to the individual sequences
259 making up each alignment as SeqRecords.
260 """
261
262
263 from __future__ import with_statement
264
265 __docformat__ = "epytext en"
266
267
268
269
270
271
272
273
274
275
276
277
278 """
279 FAO BioPython Developers
280 ========================
281 The way I envision this SeqIO system working as that for any sequence file
282 format we have an iterator that returns SeqRecord objects.
283
284 This also applies to interlaced fileformats (like clustal - although that
285 is now handled via Bio.AlignIO instead) where the file cannot be read record
286 by record. You should still return an iterator, even if the implementation
287 could just as easily return a list.
288
289 These file format specific sequence iterators may be implemented as:
290 * Classes which take a handle for __init__ and provide the __iter__ method
291 * Functions that take a handle, and return an iterator object
292 * Generator functions that take a handle, and yield SeqRecord objects
293
294 It is then trivial to turn this iterator into a list of SeqRecord objects,
295 an in memory dictionary, or a multiple sequence alignment object.
296
297 For building the dictionary by default the id propery of each SeqRecord is
298 used as the key. You should always populate the id property, and it should
299 be unique in most cases. For some file formats the accession number is a good
300 choice. If the file itself contains ambiguous identifiers, don't try and
301 dis-ambiguate them - return them as is.
302
303 When adding a new file format, please use the same lower case format name
304 as BioPerl, or if they have not defined one, try the names used by EMBOSS.
305
306 See also http://biopython.org/wiki/SeqIO_dev
307
308 --Peter
309 """
310
311
312 from Bio.File import as_handle
313 from Bio.SeqRecord import SeqRecord
314 from Bio.Align import MultipleSeqAlignment
315 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
316
317 import AbiIO
318 import AceIO
319 import FastaIO
320 import IgIO
321 import InsdcIO
322 import PdbIO
323 import PhdIO
324 import PirIO
325 import SeqXmlIO
326 import SffIO
327 import SwissIO
328 import TabIO
329 import QualityIO
330 import UniprotIO
331
332
333
334
335
336
337
338
339
340
341
342 _FormatToIterator = {"fasta": FastaIO.FastaIterator,
343 "gb": InsdcIO.GenBankIterator,
344 "genbank": InsdcIO.GenBankIterator,
345 "genbank-cds": InsdcIO.GenBankCdsFeatureIterator,
346 "embl": InsdcIO.EmblIterator,
347 "embl-cds": InsdcIO.EmblCdsFeatureIterator,
348 "imgt": InsdcIO.ImgtIterator,
349 "ig": IgIO.IgIterator,
350 "swiss": SwissIO.SwissIterator,
351 "pdb-atom": PdbIO.PdbAtomIterator,
352 "pdb-seqres": PdbIO.PdbSeqresIterator,
353 "phd": PhdIO.PhdIterator,
354 "ace": AceIO.AceIterator,
355 "tab": TabIO.TabIterator,
356 "pir": PirIO.PirIterator,
357 "fastq": QualityIO.FastqPhredIterator,
358 "fastq-sanger": QualityIO.FastqPhredIterator,
359 "fastq-solexa": QualityIO.FastqSolexaIterator,
360 "fastq-illumina": QualityIO.FastqIlluminaIterator,
361 "qual": QualityIO.QualPhredIterator,
362 "sff": SffIO.SffIterator,
363
364 "sff-trim": SffIO._SffTrimIterator,
365 "uniprot-xml": UniprotIO.UniprotIterator,
366 "seqxml": SeqXmlIO.SeqXmlIterator,
367 "abi": AbiIO.AbiIterator,
368 "abi-trim": AbiIO._AbiTrimIterator,
369 }
370
371 _FormatToWriter = {"fasta": FastaIO.FastaWriter,
372 "gb": InsdcIO.GenBankWriter,
373 "genbank": InsdcIO.GenBankWriter,
374 "embl": InsdcIO.EmblWriter,
375 "imgt": InsdcIO.ImgtWriter,
376 "tab": TabIO.TabWriter,
377 "fastq": QualityIO.FastqPhredWriter,
378 "fastq-sanger": QualityIO.FastqPhredWriter,
379 "fastq-solexa": QualityIO.FastqSolexaWriter,
380 "fastq-illumina": QualityIO.FastqIlluminaWriter,
381 "phd": PhdIO.PhdWriter,
382 "qual": QualityIO.QualPhredWriter,
383 "sff": SffIO.SffWriter,
384 "seqxml": SeqXmlIO.SeqXmlWriter,
385 }
386
387 _BinaryFormats = ["sff", "sff-trim", "abi", "abi-trim"]
388
389
390 -def write(sequences, handle, format):
391 """Write complete set of sequences to a file.
392
393 - sequences - A list (or iterator) of SeqRecord objects, or (if using
394 Biopython 1.54 or later) a single SeqRecord.
395 - handle - File handle object to write to, or filename as string
396 (note older versions of Biopython only took a handle).
397 - format - lower case string describing the file format to write.
398
399 You should close the handle after calling this function.
400
401 Returns the number of records written (as an integer).
402 """
403 from Bio import AlignIO
404
405
406 if not isinstance(format, basestring):
407 raise TypeError("Need a string for the file format (lower case)")
408 if not format:
409 raise ValueError("Format required (lower case string)")
410 if format != format.lower():
411 raise ValueError("Format string '%s' should be lower case" % format)
412
413 if isinstance(sequences, SeqRecord):
414
415 sequences = [sequences]
416
417 if format in _BinaryFormats:
418 mode = 'wb'
419 else:
420 mode = 'w'
421
422 with as_handle(handle, mode) as fp:
423
424 if format in _FormatToWriter:
425 writer_class = _FormatToWriter[format]
426 count = writer_class(fp).write_file(sequences)
427 elif format in AlignIO._FormatToWriter:
428
429
430 alignment = MultipleSeqAlignment(sequences)
431 alignment_count = AlignIO.write([alignment], fp, format)
432 assert alignment_count == 1, \
433 "Internal error - the underlying writer " \
434 " should have returned 1, not %s" % repr(alignment_count)
435 count = len(alignment)
436 del alignment_count, alignment
437 elif format in _FormatToIterator or format in AlignIO._FormatToIterator:
438 raise ValueError("Reading format '%s' is supported, but not writing"
439 % format)
440 else:
441 raise ValueError("Unknown format '%s'" % format)
442
443 assert isinstance(count, int), "Internal error - the underlying %s " \
444 "writer should have returned the record count, not %s" \
445 % (format, repr(count))
446
447 return count
448
449
450 -def parse(handle, format, alphabet=None):
451 r"""Turns a sequence file into an iterator returning SeqRecords.
452
453 - handle - handle to the file, or the filename as a string
454 (note older versions of Biopython only took a handle).
455 - format - lower case string describing the file format.
456 - alphabet - optional Alphabet object, useful when the sequence type
457 cannot be automatically inferred from the file itself
458 (e.g. format="fasta" or "tab")
459
460 Typical usage, opening a file to read in, and looping over the record(s):
461
462 >>> from Bio import SeqIO
463 >>> filename = "Fasta/sweetpea.nu"
464 >>> for record in SeqIO.parse(filename, "fasta"):
465 ... print "ID", record.id
466 ... print "Sequence length", len(record)
467 ... print "Sequence alphabet", record.seq.alphabet
468 ID gi|3176602|gb|U78617.1|LOU78617
469 Sequence length 309
470 Sequence alphabet SingleLetterAlphabet()
471
472 For file formats like FASTA where the alphabet cannot be determined, it
473 may be useful to specify the alphabet explicitly:
474
475 >>> from Bio import SeqIO
476 >>> from Bio.Alphabet import generic_dna
477 >>> filename = "Fasta/sweetpea.nu"
478 >>> for record in SeqIO.parse(filename, "fasta", generic_dna):
479 ... print "ID", record.id
480 ... print "Sequence length", len(record)
481 ... print "Sequence alphabet", record.seq.alphabet
482 ID gi|3176602|gb|U78617.1|LOU78617
483 Sequence length 309
484 Sequence alphabet DNAAlphabet()
485
486 If you have a string 'data' containing the file contents, you must
487 first turn this into a handle in order to parse it:
488
489 >>> data = ">Alpha\nACCGGATGTA\n>Beta\nAGGCTCGGTTA\n"
490 >>> from Bio import SeqIO
491 >>> from StringIO import StringIO
492 >>> for record in SeqIO.parse(StringIO(data), "fasta"):
493 ... print record.id, record.seq
494 Alpha ACCGGATGTA
495 Beta AGGCTCGGTTA
496
497 Use the Bio.SeqIO.read(...) function when you expect a single record
498 only.
499 """
500
501
502
503 from Bio import AlignIO
504
505
506 if format in _BinaryFormats:
507 mode = 'rb'
508 else:
509 mode = 'rU'
510
511
512 if not isinstance(format, basestring):
513 raise TypeError("Need a string for the file format (lower case)")
514 if not format:
515 raise ValueError("Format required (lower case string)")
516 if format != format.lower():
517 raise ValueError("Format string '%s' should be lower case" % format)
518 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
519 isinstance(alphabet, AlphabetEncoder)):
520 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
521
522 with as_handle(handle, mode) as fp:
523
524 if format in _FormatToIterator:
525 iterator_generator = _FormatToIterator[format]
526 if alphabet is None:
527 i = iterator_generator(fp)
528 else:
529 try:
530 i = iterator_generator(fp, alphabet=alphabet)
531 except TypeError:
532 i = _force_alphabet(iterator_generator(fp), alphabet)
533 elif format in AlignIO._FormatToIterator:
534
535 i = (r for alignment in AlignIO.parse(fp, format,
536 alphabet=alphabet)
537 for r in alignment)
538 else:
539 raise ValueError("Unknown format '%s'" % format)
540
541 for r in i:
542 yield r
543
544
558
559
560 -def read(handle, format, alphabet=None):
561 """Turns a sequence file into a single SeqRecord.
562
563 - handle - handle to the file, or the filename as a string
564 (note older versions of Biopython only took a handle).
565 - format - string describing the file format.
566 - alphabet - optional Alphabet object, useful when the sequence type
567 cannot be automatically inferred from the file itself
568 (e.g. format="fasta" or "tab")
569
570 This function is for use parsing sequence files containing
571 exactly one record. For example, reading a GenBank file:
572
573 >>> from Bio import SeqIO
574 >>> record = SeqIO.read("GenBank/arab1.gb", "genbank")
575 >>> print "ID", record.id
576 ID AC007323.5
577 >>> print "Sequence length", len(record)
578 Sequence length 86436
579 >>> print "Sequence alphabet", record.seq.alphabet
580 Sequence alphabet IUPACAmbiguousDNA()
581
582 If the handle contains no records, or more than one record,
583 an exception is raised. For example:
584
585 >>> from Bio import SeqIO
586 >>> record = SeqIO.read("GenBank/cor6_6.gb", "genbank")
587 Traceback (most recent call last):
588 ...
589 ValueError: More than one record found in handle
590
591 If however you want the first record from a file containing
592 multiple records this function would raise an exception (as
593 shown in the example above). Instead use:
594
595 >>> from Bio import SeqIO
596 >>> record = SeqIO.parse("GenBank/cor6_6.gb", "genbank").next()
597 >>> print "First record's ID", record.id
598 First record's ID X55053.1
599
600 Use the Bio.SeqIO.parse(handle, format) function if you want
601 to read multiple records from the handle.
602 """
603 iterator = parse(handle, format, alphabet)
604 try:
605 first = iterator.next()
606 except StopIteration:
607 first = None
608 if first is None:
609 raise ValueError("No records found in handle")
610 try:
611 second = iterator.next()
612 except StopIteration:
613 second = None
614 if second is not None:
615 raise ValueError("More than one record found in handle")
616 return first
617
618
619 -def to_dict(sequences, key_function=None):
620 """Turns a sequence iterator or list into a dictionary.
621
622 - sequences - An iterator that returns SeqRecord objects,
623 or simply a list of SeqRecord objects.
624 - key_function - Optional callback function which when given a
625 SeqRecord should return a unique key for the dictionary.
626
627 e.g. key_function = lambda rec : rec.name
628 or, key_function = lambda rec : rec.description.split()[0]
629
630 If key_function is omitted then record.id is used, on the assumption
631 that the records objects returned are SeqRecords with a unique id.
632
633 If there are duplicate keys, an error is raised.
634
635 Example usage, defaulting to using the record.id as key:
636
637 >>> from Bio import SeqIO
638 >>> filename = "GenBank/cor6_6.gb"
639 >>> format = "genbank"
640 >>> id_dict = SeqIO.to_dict(SeqIO.parse(filename, format))
641 >>> print sorted(id_dict)
642 ['AF297471.1', 'AJ237582.1', 'L31939.1', 'M81224.1', 'X55053.1', 'X62281.1']
643 >>> print id_dict["L31939.1"].description
644 Brassica rapa (clone bif72) kin mRNA, complete cds.
645
646 A more complex example, using the key_function argument in order to
647 use a sequence checksum as the dictionary key:
648
649 >>> from Bio import SeqIO
650 >>> from Bio.SeqUtils.CheckSum import seguid
651 >>> filename = "GenBank/cor6_6.gb"
652 >>> format = "genbank"
653 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(filename, format),
654 ... key_function = lambda rec : seguid(rec.seq))
655 >>> for key, record in sorted(seguid_dict.iteritems()):
656 ... print key, record.id
657 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1
658 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1
659 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1
660 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1
661 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1
662 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1
663
664 This approach is not suitable for very large sets of sequences, as all
665 the SeqRecord objects are held in memory. Instead, consider using the
666 Bio.SeqIO.index() function (if it supports your particular file format).
667 """
668 if key_function is None:
669 key_function = lambda rec: rec.id
670
671 d = dict()
672 for record in sequences:
673 key = key_function(record)
674 if key in d:
675 raise ValueError("Duplicate key '%s'" % key)
676 d[key] = record
677 return d
678
679
680 -def index(filename, format, alphabet=None, key_function=None):
681 """Indexes a sequence file and returns a dictionary like object.
682
683 - filename - string giving name of file to be indexed
684 - format - lower case string describing the file format
685 - alphabet - optional Alphabet object, useful when the sequence type
686 cannot be automatically inferred from the file itself
687 (e.g. format="fasta" or "tab")
688 - key_function - Optional callback function which when given a
689 SeqRecord identifier string should return a unique
690 key for the dictionary.
691
692 This indexing function will return a dictionary like object, giving the
693 SeqRecord objects as values:
694
695 >>> from Bio import SeqIO
696 >>> records = SeqIO.index("Quality/example.fastq", "fastq")
697 >>> len(records)
698 3
699 >>> sorted(records)
700 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792']
701 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta")
702 >EAS54_6_R1_2_1_540_792
703 TTGGCAGGCCAAGGCCGATGGATCA
704 <BLANKLINE>
705 >>> "EAS54_6_R1_2_1_540_792" in records
706 True
707 >>> print records.get("Missing", None)
708 None
709
710 If the file is BGZF compressed, this is detected automatically. Ordinary
711 GZIP files are not supported:
712
713 >>> from Bio import SeqIO
714 >>> records = SeqIO.index("Quality/example.fastq.bgz", "fastq")
715 >>> len(records)
716 3
717 >>> print records["EAS54_6_R1_2_1_540_792"].seq
718 TTGGCAGGCCAAGGCCGATGGATCA
719
720 Note that this pseudo dictionary will not support all the methods of a
721 true Python dictionary, for example values() is not defined since this
722 would require loading all of the records into memory at once.
723
724 When you call the index function, it will scan through the file, noting
725 the location of each record. When you access a particular record via the
726 dictionary methods, the code will jump to the appropriate part of the
727 file and then parse that section into a SeqRecord.
728
729 Note that not all the input formats supported by Bio.SeqIO can be used
730 with this index function. It is designed to work only with sequential
731 file formats (e.g. "fasta", "gb", "fastq") and is not suitable for any
732 interlaced file format (e.g. alignment formats such as "clustal").
733
734 For small files, it may be more efficient to use an in memory Python
735 dictionary, e.g.
736
737 >>> from Bio import SeqIO
738 >>> records = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fastq"), "fastq"))
739 >>> len(records)
740 3
741 >>> sorted(records)
742 ['EAS54_6_R1_2_1_413_324', 'EAS54_6_R1_2_1_443_348', 'EAS54_6_R1_2_1_540_792']
743 >>> print records["EAS54_6_R1_2_1_540_792"].format("fasta")
744 >EAS54_6_R1_2_1_540_792
745 TTGGCAGGCCAAGGCCGATGGATCA
746 <BLANKLINE>
747
748 As with the to_dict() function, by default the id string of each record
749 is used as the key. You can specify a callback function to transform
750 this (the record identifier string) into your preferred key. For example:
751
752 >>> from Bio import SeqIO
753 >>> def make_tuple(identifier):
754 ... parts = identifier.split("_")
755 ... return int(parts[-2]), int(parts[-1])
756 >>> records = SeqIO.index("Quality/example.fastq", "fastq",
757 ... key_function=make_tuple)
758 >>> len(records)
759 3
760 >>> sorted(records)
761 [(413, 324), (443, 348), (540, 792)]
762 >>> print records[(540, 792)].format("fasta")
763 >EAS54_6_R1_2_1_540_792
764 TTGGCAGGCCAAGGCCGATGGATCA
765 <BLANKLINE>
766 >>> (540, 792) in records
767 True
768 >>> "EAS54_6_R1_2_1_540_792" in records
769 False
770 >>> print records.get("Missing", None)
771 None
772
773 Another common use case would be indexing an NCBI style FASTA file,
774 where you might want to extract the GI number from the FASTA identifer
775 to use as the dictionary key.
776
777 Notice that unlike the to_dict() function, here the key_function does
778 not get given the full SeqRecord to use to generate the key. Doing so
779 would impose a severe performance penalty as it would require the file
780 to be completely parsed while building the index. Right now this is
781 usually avoided.
782
783 See also: Bio.SeqIO.index_db() and Bio.SeqIO.to_dict()
784 """
785
786 if not isinstance(filename, basestring):
787 raise TypeError("Need a filename (not a handle)")
788 if not isinstance(format, basestring):
789 raise TypeError("Need a string for the file format (lower case)")
790 if not format:
791 raise ValueError("Format required (lower case string)")
792 if format != format.lower():
793 raise ValueError("Format string '%s' should be lower case" % format)
794 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
795 isinstance(alphabet, AlphabetEncoder)):
796 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
797
798
799 from _index import _FormatToRandomAccess
800 from Bio.File import _IndexedSeqFileDict
801 try:
802 proxy_class = _FormatToRandomAccess[format]
803 except KeyError:
804 raise ValueError("Unsupported format %r" % format)
805 repr = "SeqIO.index(%r, %r, alphabet=%r, key_function=%r)" \
806 % (filename, format, alphabet, key_function)
807 return _IndexedSeqFileDict(proxy_class(filename, format, alphabet),
808 key_function, repr, "SeqRecord")
809
810
811 -def index_db(index_filename, filenames=None, format=None, alphabet=None,
812 key_function=None):
813 """Index several sequence files and return a dictionary like object.
814
815 The index is stored in an SQLite database rather than in memory (as in the
816 Bio.SeqIO.index(...) function).
817
818 - index_filename - Where to store the SQLite index
819 - filenames - list of strings specifying file(s) to be indexed, or when
820 indexing a single file this can be given as a string.
821 (optional if reloading an existing index, but must match)
822 - format - lower case string describing the file format
823 (optional if reloading an existing index, but must match)
824 - alphabet - optional Alphabet object, useful when the sequence type
825 cannot be automatically inferred from the file itself
826 (e.g. format="fasta" or "tab")
827 - key_function - Optional callback function which when given a
828 SeqRecord identifier string should return a unique
829 key for the dictionary.
830
831 This indexing function will return a dictionary like object, giving the
832 SeqRecord objects as values:
833
834 >>> from Bio.Alphabet import generic_protein
835 >>> from Bio import SeqIO
836 >>> files = ["GenBank/NC_000932.faa", "GenBank/NC_005816.faa"]
837 >>> def get_gi(name):
838 ... parts = name.split("|")
839 ... i = parts.index("gi")
840 ... assert i != -1
841 ... return parts[i+1]
842 >>> idx_name = ":memory:" #use an in memory SQLite DB for this test
843 >>> records = SeqIO.index_db(idx_name, files, "fasta", generic_protein, get_gi)
844 >>> len(records)
845 95
846 >>> records["7525076"].description
847 'gi|7525076|ref|NP_051101.1| Ycf2 [Arabidopsis thaliana]'
848 >>> records["45478717"].description
849 'gi|45478717|ref|NP_995572.1| pesticin [Yersinia pestis biovar Microtus str. 91001]'
850
851 In this example the two files contain 85 and 10 records respectively.
852
853 BGZF compressed files are supported, and detected automatically. Ordinary
854 GZIP compressed files are not supported.
855
856 See also: Bio.SeqIO.index() and Bio.SeqIO.to_dict()
857 """
858
859 if not isinstance(index_filename, basestring):
860 raise TypeError("Need a string for the index filename")
861 if isinstance(filenames, basestring):
862
863
864 filenames = [filenames]
865 if filenames is not None and not isinstance(filenames, list):
866 raise TypeError(
867 "Need a list of filenames (as strings), or one filename")
868 if format is not None and not isinstance(format, basestring):
869 raise TypeError("Need a string for the file format (lower case)")
870 if format and format != format.lower():
871 raise ValueError("Format string '%s' should be lower case" % format)
872 if alphabet is not None and not (isinstance(alphabet, Alphabet) or
873 isinstance(alphabet, AlphabetEncoder)):
874 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
875
876
877 from _index import _FormatToRandomAccess
878 from Bio.File import _SQLiteManySeqFilesDict
879 repr = "SeqIO.index_db(%r, filenames=%r, format=%r, alphabet=%r, key_function=%r)" \
880 % (index_filename, filenames, format, alphabet, key_function)
881
882 def proxy_factory(format, filename=None):
883 """Given a filename returns proxy object, else boolean if format OK."""
884 if filename:
885 return _FormatToRandomAccess[format](filename, format, alphabet)
886 else:
887 return format in _FormatToRandomAccess
888
889 return _SQLiteManySeqFilesDict(index_filename, filenames,
890 proxy_factory, format,
891 key_function, repr)
892
893
894 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
895 """Convert between two sequence file formats, return number of records.
896
897 - in_file - an input handle or filename
898 - in_format - input file format, lower case string
899 - out_file - an output handle or filename
900 - out_format - output file format, lower case string
901 - alphabet - optional alphabet to assume
902
903 NOTE - If you provide an output filename, it will be opened which will
904 overwrite any existing file without warning. This may happen if even
905 the conversion is aborted (e.g. an invalid out_format name is given).
906
907 For example, going from a filename to a handle:
908
909 >>> from Bio import SeqIO
910 >>> from StringIO import StringIO
911 >>> handle = StringIO("")
912 >>> SeqIO.convert("Quality/example.fastq", "fastq", handle, "fasta")
913 3
914 >>> print handle.getvalue()
915 >EAS54_6_R1_2_1_413_324
916 CCCTTCTTGTCTTCAGCGTTTCTCC
917 >EAS54_6_R1_2_1_540_792
918 TTGGCAGGCCAAGGCCGATGGATCA
919 >EAS54_6_R1_2_1_443_348
920 GTTGCTTCTGGCGTGGGTGGGGGGG
921 <BLANKLINE>
922 """
923
924 if in_format in _BinaryFormats:
925 in_mode = 'rb'
926 else:
927 in_mode = 'rU'
928
929
930 if out_format in ["sff", "sff_trim"]:
931 out_mode = 'wb'
932 else:
933 out_mode = 'w'
934
935
936
937 from _convert import _handle_convert
938 with as_handle(in_file, in_mode) as in_handle:
939 with as_handle(out_file, out_mode) as out_handle:
940 count = _handle_convert(in_handle, in_format,
941 out_handle, out_format,
942 alphabet)
943 return count
944
945
946 if __name__ == "__main__":
947 from Bio._utils import run_doctest
948 run_doctest()
949