From Biopython
Revision as of 20:48, 18 August 2007 by Maubp (Talk | contribs)
Jump to: navigation, search

This page will describe the SeqRecord object used in BioPython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.

Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too). The new SeqIO system will only return SeqRecord objects.

Extracting information from a SeqRecord

Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as [ls_orchid.gbk]. This file contains 94 records:

from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) :
    print "index %i, ID = %s, length %i, with %i features" \
          % (index,, len(record.seq), len(record.features))

And this is some of the output. Remember python likes to count from zero, so for the 94 records in this file they have been labelled 0 to 93:

index 0, ID = Z78533.1, length 740, with 5 features
index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features

Lets look in a little more detail at the final record:

print record

That should give you a hint of the sort of information held in this object:

ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/organism=Paphiopedilum barbatum

Lets look a little more closely... and use python's dir() function to find out more about the SeqRecord object and what it does:


If you din't already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting underscores in their name are "special" and we'll be ignoring them in this discussion. For a SeqRecord, you'll be shown the following:

[..., 'annotations', 'dbxrefs', 'description', 'features', 'id', 'name', 'seq']

We'll start with the seq property:

print record.seq

That should give:


This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.

The next three properties are all simple strings:

print record.description
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.

Have a look at the raw GenBank file to see where these came from.

Next, we'll check the dxrefs property, which holds any database cross references:

print record.dbxrefs

How about the annotations property? This is a python dictionary...

print record.annotations
print record.annotations["source"]
{'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
Paphiopedilum barbatum

In this case, most of the values in the dictionary are simple strings, but this isn't always the case.

Personal tools