Google Summer of Code

From Biopython
Revision as of 13:39, 2 March 2012 by Peter (Talk | contribs)
Jump to: navigation, search

As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2011. This page contains a list of project ideas for the upcoming summer; potential GSoC students can base an application on any of these ideas, or propose something new.

In 2009, Biopython was involved with GSoC in collaboration with our friends at NESCent, and had two projects funded:

In 2010, another project was funded:

In 2011, three projects were funded in Biopython via OBF:

Please read the GSoC page at the Open Bioinformatics Foundation and the main Google Summer of Code page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the mailing list.

2012 Project ideas

SearchIO (DRAFT)

Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle & water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.

Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object.

Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.

The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.
Involved toolkits or projects 
  • Biopython
Degree of difficulty and needed skills 
Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.
Peter Cock

Variant representation, parser, generator, and coordinate converter (DRAFT)

2012 GSoC updates are being considered, this is the text from a 2011 proposal which needs to be updated. Stay tuned.

Computational analysis of genomic variation requires the ability to reliably translate between human and computer representations of genomic variants. While several standards for human variation syntax have been proposed, community support is limited because of the technical complexity of the proposals and the lack of software libraries that implement them. The goal of this project is to initiate freely-available, language-neutral tools to parse, generate, and convert between representations of genomic variation.
Approach and Goals 
  • identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
  • develop internal machine representation for variation types in Python, perhaps by implementing subclasses of BioPython's SeqFeature class.
  • develop language-neutral grammar for the (reasonably) supportable subset of the Human Genome Variation Society nomeclature guidelines
  • write a Python library to convert between machine and human representations of variation (i.e., parsing and generating)
  • develop coordinate mapping between genomic, cDNA, and protein sequences (at least)
  • release code to appropriate community efforts and write short manuscript
  • as time permits:
    • build Perl modules or Java libraries with identical functionality
    • develop syntactic and semantic validation
    • implement web service for coordinate conversion using NCBI Eutilities
    • develop a new variant syntax that is representation-complete
The major challenge in this project is to design an API which cleanly separates internal representations of variation from the multiple external representations. For example, coordinate conversion per se does not require any sequence information, but validating a variant does. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.
Reece Hart (Locus Development, San Francisco); Brad Chapman
Personal tools