SeqRecord
| Line 2: | Line 2: | ||
Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too). The new [[SeqIO]] system will only return SeqRecord objects. | Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too). The new [[SeqIO]] system will only return SeqRecord objects. | ||
| + | |||
| + | == Extracting information from a SeqRecord == | ||
Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as [[http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk ls_orchid.gbk]]. This file contains 94 records: | Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as [[http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk ls_orchid.gbk]]. This file contains 94 records: | ||
| Line 42: | Line 44: | ||
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA()) | Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA()) | ||
| − | Lets look a little more closely... we'll start with the '''seq''' property: | + | Lets look a little more closely... and use python's '''dir()''' function to find out more about the SeqRecord object and what it does: |
| + | |||
| + | <python>dir(record)</python> | ||
| + | |||
| + | If you din't already know, the '''dir()''' function returns a list of all the methods and properties of an object (as strings). Those starting underscores in their name are "special" and we'll be ignoring them in this discussion. For a SeqRecord, you'll be shown the following: | ||
| + | |||
| + | [..., 'annotations', 'dbxrefs', 'description', 'features', 'id', 'name', 'seq'] | ||
| + | |||
| + | We'll start with the '''seq''' property: | ||
<python> | <python> | ||
| Line 67: | Line 77: | ||
Have a look at the raw GenBank file to see where these came from. | Have a look at the raw GenBank file to see where these came from. | ||
| + | |||
| + | Next, we'll check the '''dxrefs''' property, which holds any database cross references: | ||
| + | |||
| + | <python>print record.dbxrefs</python> | ||
| + | |||
| + | [] | ||
| + | |||
| + | How about the '''annotations''' property? This is a python dictionary... | ||
| + | |||
| + | <python> | ||
| + | print record.annotations | ||
| + | print record.annotations["source"] | ||
| + | </python> | ||
| + | |||
| + | {'source': 'Paphiopedilum barbatum', 'taxonomy': ...} | ||
| + | Paphiopedilum barbatum | ||
| + | |||
| + | In this case, most of the values in the dictionary are simple strings, but this isn't always the case. | ||
Revision as of 20:48, 18 August 2007
This page will describe the SeqRecord object used in BioPython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.
Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too). The new SeqIO system will only return SeqRecord objects.
Extracting information from a SeqRecord
Lets look in closer detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as [ls_orchid.gbk]. This file contains 94 records:
from Bio import SeqIO for index, record in enumerate(SeqIO.parse(open("ls_orchid.gbk"), "genbank")) : print "index %i, ID = %s, length %i, with %i features" \ % (index, record.id, len(record.seq), len(record.features))
And this is some of the output. Remember python likes to count from zero, so for the 94 records in this file they have been labelled 0 to 93:
index 0, ID = Z78533.1, length 740, with 5 features index 1, ID = Z78532.1, length 753, with 5 features index 2, ID = Z78531.1, length 748, with 5 features ... index 92, ID = Z78440.1, length 744, with 5 features index 93, ID = Z78439.1, length 592, with 5 features
Lets look in a little more detail at the final record:
print record
That should give you a hint of the sort of information held in this object:
ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilum barbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
Lets look a little more closely... and use python's dir() function to find out more about the SeqRecord object and what it does:
dir(record)
If you din't already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting underscores in their name are "special" and we'll be ignoring them in this discussion. For a SeqRecord, you'll be shown the following:
[..., 'annotations', 'dbxrefs', 'description', 'features', 'id', 'name', 'seq']
We'll start with the seq property:
print record.seq
That should give:
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.
The next three properties are all simple strings:
print record.id print record.name print record.description
Z78439.1 Z78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Have a look at the raw GenBank file to see where these came from.
Next, we'll check the dxrefs property, which holds any database cross references:
print record.dbxrefs
[]
How about the annotations property? This is a python dictionary...
print record.annotations print record.annotations["source"]
{'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
Paphiopedilum barbatum
In this case, most of the values in the dictionary are simple strings, but this isn't always the case.