This page describes Biopython's support for the Gene Ontology (GO). Currently this is in the discussion stages. As we implement more support, this page will evolve into documentation for using Biopython packages to work with GO and GO annotations.
Brief Summary of GO
The GO consortium developed GO to standardize the vocabulary life scientists use to annotate genes, and to make such annotations amenable to computational assessment of their semantics. GO is actually three separate but related ontologies for three categories: biological process, cellular component, and molecular function. The ontologies are composed of terms. The terms have relations which connect them into child/ancestor relationships, where children transitively inherit the meanings of all of their ancestors. Thus, child nodes are more specific in their semantic meaning. Using the GO terms as nodes, and relations as directed edges (from a child term to a parent term), we can construct a data structure known as a directed acyclic graph (DAG). We use this DAG to traverse the ontologies and identify relationships between GO terms.
Design of GO Support
GO support should include supporting both the actual GO ontologies and GO annotations.
GO Directed Acyclic Graph
GO is best represented as a directed acyclic graph (DAG). To facilitate this data structure, we'll use NetworkX, a popular, well-supported Python graph library with no required dependencies other than Python. We'll use the directed graph class DiGraph of NetworkX to represent the ontologies.
Sources of Code/Inspiration
BioPerl has a package providing GO support. It is not obvious yet what each component does and it seems quite heavily engineered. Beginning GO support for Biopython will probably not reach this level of sophistication but we can borrow from the ideas of the BioPerl library.
- Brad Chapman's PGML code, which apparently works with GO that's already loaded into a database
Design of GO Annotation Support
GO annotations come in the form of GOA files. These are rather simple tab-delimited flat files that contain gene identifiers, GO term IDs, and more. The idea of providing support for GOA files is to provide a parser that renders the file into Python data structures, and preferably as instances of some standard Biopython.
The GOA files are tab-delimited and quite simple, with one line per annotation. The format is detailed at http://www.geneontology.org/GO.format.annotation.shtml.
We could use some sort of standard Biopython structure to store annotations in. We could use SeqRecord but this would be awkward, because it is intended to be used with a sequence. Note, however, that SeqRecord does contain an annotations attribute, which would be where we would like to place these GO annotations. (It's also important to keep in mind each gene may have multiple annotations.) Peter Cock points out that the QUAL parser uses a class called UnknownSeq since it only contains information on the sequence length. We should take a look at this.
One possibility is to create an Annotation class as a fully supported Biopython class. More input from other Biopython developers would help here.
As far as I can tell, this data structure really doesn't have to be anything more than a data store, with some __repr__() --Gotgenes 05:05, 20 October 2009 (UTC)