(Difference between revisions)
Jump to: navigation, search
(Creating a SeqRecord object)
m (Extracting information from a SeqRecord: note on dbxrefs)
Line 108: Line 108:
An empty list? Disappointing...
An empty list? Disappointing... if we'd used a more recent GenBank file the genome sequencing project reference would show up here.
How about the '''annotations''' property? This is a python dictionary...
How about the '''annotations''' property? This is a python dictionary...

Revision as of 17:37, 9 April 2009

This page describes the SeqRecord object used in BioPython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.

Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too, see for example Bio.SwissProt). The SeqIO system will only return SeqRecord objects.

There is more information in the Tutorial (PDF), and the SeqIO page is also very relevant.

Creating a SeqRecord object

Most of the time you'll create SeqRecord objects by parsing a sequence file with Bio.SeqIO. However, it is useful to know how to create a SeqRecord directly. For example,

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print record

This would give the following output:

   ID: YP_025292.1
   Name: HokC
   Description: toxic membrane protein, small
   Number of features: 0

Extracting information from a SeqRecord

Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as ls_orchid.gbk, which we'll load using the SeqIO module. This file contains 94 records:

from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) :
    print "index %i, ID = %s, length %i, with %i features" \
          % (index,, len(record.seq), len(record.features))

And this is some of the output. Remember python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:

index 0, ID = Z78533.1, length 740, with 5 features
index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features

Lets look in a little more detail at the final record:

print record

That should give you a hint of the sort of information held in this object:

ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/organism=Paphiopedilum barbatum

Lets look a little more closely... and use python's dir() function to find out more about the SeqRecord object and what it does:

>>> dir(record)
[..., 'annotations', 'dbxrefs', 'description', 'features', 'id', 'name', 'seq']

If you didn't already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting with underscores in their name are "special" and we'll be ignoring them in this discussion. We'll start with the seq property:

>>> print record.seq
>>> print record.seq.__class__

This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.

The next three properties are all simple strings:

>>> print
>>> print
>>> print record.description
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.

Have a look at the raw GenBank file to see where these came from.

Next, we'll check the dxrefs property, which holds any database cross references:

>>> print record.dbxrefs
>>> print record.dbxrefs.__class__
<type 'list'>

An empty list? Disappointing... if we'd used a more recent GenBank file the genome sequencing project reference would show up here.

How about the annotations property? This is a python dictionary...

>>> print record.annotations
{'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
>>> print record.annotations.__class__
<type 'dict'>
>>> print record.annotations["source"]
Paphiopedilum barbatum

In this case, most of the values in the dictionary are simple strings, but this isn't always the case - have a look at the references entry for this example - its a list of Reference objects:

>>> print record.annotations["references"].__class__
<type 'list'>
>>> print len(record.annotations["references"])
>>> for ref in record.annotations["references"] : print ref.authors
Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W.

That brings us finally to features which is another list property, and it contains SeqFeature objects:

>>> print record.features.__class__
<type 'list'>
>>> print len(record.features)

SeqFeature objects are complicated enough to warrent their own page...

Personal tools