Google Summer of Code

(Difference between revisions)
Jump to: navigation, search
m (Biopython and PyCogent interoperability: changed phyloXML capitalization, relinked my name)
(Link to new SearchIO page)
(44 intermediate revisions by 8 users not shown)
Line 1: Line 1:
Biopython was involved with the 2009 Google Summer of Code (GSoC) in collaboration with our friends at [https://www.nescent.org/wg_phyloinformatics/Main_Page NESCent], and had two projects funded:
+
As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2012. This page contains a list of project ideas for the upcoming summer; potential GSoC students can base an application on any of these ideas, or propose something new.
 +
 
 +
In 2009, Biopython was involved with GSoC in collaboration with our friends at [https://www.nescent.org/wg_phyloinformatics/Main_Page NESCent], and had two projects funded:
  
 
* Nick Matzke worked on [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biogeographical_Phylogenetics_for_BioPython Biogeographical Phylogenetics].
 
* Nick Matzke worked on [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biogeographical_Phylogenetics_for_BioPython Biogeographical Phylogenetics].
* Eric Talevich added support for [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biopython_support_for_parsing_and_writing_phyloXML parsing and writing phyloXML].
+
* [[User:EricTalevich|Eric Talevich]] added support for [https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biopython_support_for_parsing_and_writing_phyloXML parsing and writing phyloXML].
  
In 2010 we hope to be continue working with GSoC. If you are interested in contributing as a mentor or student, please introduce yourself on the [http://biopython.org/wiki/Mailing_lists mailing list].
+
In 2010, another project was funded:
  
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the main [http://code.google.com/soc Google Summer of Code] page for more details about the program.  
+
* João Rodrigues [[GSOC2010_Joao|worked on the Structural Biology module Bio.PDB]], adding several features used in everyday structural bioinformatics. These features are now gradually being merged into the mainline with João's help.
  
== 2010 Project ideas ==
+
In 2011, three projects were funded in Biopython via the OBF:
  
=== Biopython and PyCogent interoperability ===
+
* [[User:Mtrellet|Mikael Trellet]] added [[GSoC2011_mtrellet|support for biomolecular interface analysis]] to the Bio.PDB module.
 +
* Michele Silva wrote a [[GSOC2011_Mocapy|Python bridge for Mocapy++]] and linked it to Bio.PDB to enable statistical analysis of protein structures.
 +
* Justinas Daugmaudis also enhanced Mocapy++ in a complementary way, developing a [[GSOC2011_MocapyExt|plugin system for Mocapy++]] allowing users to easily write new nodes (probability distribution functions) in Python.
  
; Rationale : [http://pycogent.sourceforge.net/ PyCogent] and [http://biopython.org/wiki/Main_Page Biopython] are two widely used toolkits for performing computational biology and bioinformatics work in Python. The libraries have had traditionally different focuses: with Biopython focusing on sequence parsing and retrieval and PyCogent on evolutionary and phylogenetic processing. Both user communities would benefit from increased interoperability between the code bases, easing the developing of complex workflows.
+
In 2012, two projects were funded in Biopython via the OBF:
  
; Approach : The student would focus on soliciting use case scenarios from developers and the larger communities associated with both projects, and use these as the basis for adding glue code and documentation to both libraries. Some use cases of immediate interest as a starting point are:
+
* Wibowo Arindrarto: ''[[SearchIO]] Implementation in Biopython'' ([http://bow.web.id/blog/tag/gsoc/ blog])
 +
* Lenna Peterson: ''Diff My DNA: Development of a Genomic Variant Toolkit for Biopython'' ([http://arklenna.tumblr.com/tagged/gsoc2012 blog])
  
:* Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.).
+
Please read the [http://www.open-bio.org/wiki/Google_Summer_of_Code GSoC page at the Open Bioinformatics Foundation] and the main [http://code.google.com/soc Google Summer of Code] page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the [http://biopython.org/wiki/Mailing_lists mailing list].
:* Building workflows using Codon Usage analyses in PyCogent with clustering code in Biopython.
+
:* Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code.
+
:* Integrate Biopython's [http://biopython.org/wiki/Phylo phyloXML support], developed during GSoC 2009, with PyCogent.
+
:* Develop a standardised controller architecture for interrogation of genome databases by extending PyCogent's Ensembl code, including export to Biopython objects.
+
  
; Challenges : This project provides the student with a lot of freedom to create useful interoperability between two feature rich libraries. As opposed to projects which might require churning out more lines of code, the major challenge here will be defining useful APIs and interfaces for existing code. High level inventiveness and coding skill will be required for generating glue code; we feel library integration is an extremely beneficial skill. We also value clear use case based documentation to support the new interfaces.
+
== 2012 Project ideas ==
  
; Involved toolkits or projects :
+
=== SearchIO ===
  
:* [http://biopython.org/wiki/Main_Page Biopython]
+
; Rationale : Biopython has general APIs for parsing and writing assorted sequence file formats ([[SeqIO]]), multiple sequence alignments ([[AlignIO]]), phylogenetic trees ([[Phylo]]) and motifs (Bio.Motif). An obvious omission is something equivalent to [[bp:HOWTO:SearchIO|BioPerl's SearchIO]]. The goal of this proposal is to develop an easy-to-use Python interface in the same style as [[SeqIO]], [[AlignIO]], etc but for pairwise search results. This would aim to cover EMBOSS muscle & water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.
:* [http://pycogent.sourceforge.net/ PyCogent]
+
  
; Degree of difficulty and needed skills : Medium to Hard. At a minimum, the student will need to be highly competent in Python and become familiar with core objects in PyCogent and Biopython. Sub-projects will require additional expertise, for instance: familiarity with concepts in phylogenetics and genome biology; understanding SQL dialects.
+
Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the [[SeqIO]] and [[AlignIO]] modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object.
  
; Mentors : [http://jcsmr.anu.edu.au/org/dmb/compgen/ Gavin Huttley], [http://chem.colorado.edu/index.php?option=com_content&view=article&id=263:rob-knight Rob Knight], [http://bcbio.wordpress.com Brad Chapman], [[User:EricTalevich|Eric Talevich]]
+
Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.
  
=== Galaxy phylogenetics pipeline development ===
+
; Challenges : The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.
 
+
; Rationale : [http://main.g2.bx.psu.edu/ Galaxy] is a popular web based interface for integrating biological tools and analysis pipelines. It is widely used by bench biologists for their analysis work, and by computational biologists for building interfaces to developed tools. [http://hyphy.org HyPhy] provides a popular package for molecular evolution and sequence statistical analysis, and the [http://www.datamonkey.org/ datamonkey.org] server provides web based workflows to perform a number of common tasks with HyPhy. This project bridges these two complementary projects by bringing HyPhy workflows into the Galaxy system, standardizing these analyses on a widely used platform.
+
 
+
; Approach : The student would bring existing workflows from datamonkey.org to Galaxy. The general approach would be to pick a datamonkey.org workflow, wrap the relevant tools using [http://bitbucket.org/galaxy/galaxy-central/wiki/AddToolTutorial Galaxy's XML tool definition language], and implement a shared pipeline with [http://screencast.g2.bx.psu.edu/galaxy/flash/WorkflowFromHistory.html Galaxy's workflow system]. Functional tests will be developed for tools and workflows, along with high level documentation for end users.
+
 
+
; Challenges : This project requires the student to become comfortable working in the existing Galaxy framework. This is a useful practical skill as Galaxy is widely used in the biological community. Similarly, the student should become familiar with the statistical evolutionary methods in HyPhy to feel comfortable wrapping and testing them in Galaxy. Since the tools would be widely used from the main Galaxy website and installed instances, we place a strong emphasis on students who feel comfortable building tests and examples that would ensure the developed workflows function as expected.
+
 
+
; Involved toolkits or projects :
+
 
+
:* [http://bitbucket.org/galaxy/galaxy-central/wiki/Home Galaxy]
+
:* [http://hyphy.org HyPhy]
+
:* [http://www.datamonkey.org Adaptive Evolution Server]
+
 
+
; Degree of difficulty and needed skills : Medium to Hard. As envisioned, the project would involve implementing full phylogenetic pipelines with the Galaxy toolkits. This would require becoming familiar with the Galaxy tool integration framework as well as being comfortable with HyPhy tools and current pipelines. This would involve comfort with XML for developing the tool interfaces, and Python for integrating scripts and tests with Galaxy and HyPhy.
+
 
+
; Mentors : [http://www.hyphy.org/sergei/ Sergei L Kosakovsky Pond], [http://bcbio.wordpress.com Brad Chapman], [http://www.bx.psu.edu/~anton/ Anton Nekrutenko]
+
 
+
=== Accessing R phylogenetic tools from Python ===
+
 
+
; Rationale : The [http://www.r-project.org/ R statistical language] is a powerful open-source environment for statistical computation and visualization. [http://www.python.org/ Python] serves as an excellent complement to R since it has a wide variety of available libraries to make data processing, analysis, and web presentation easier. The two can be smoothly interfaced using [http://bitbucket.org/lgautier/rpy2/ Rpy2], allowing programmers to leverage the best features of each language. Here we propose to build Rpy2 library components to help ease access to phylogenetic and biogeographical libraries in R.
+
 
+
; Approach : Rpy2 contains higher level interfaces to popular R libraries. For instance, the [http://rpy.sourceforge.net/rpy2/doc-2.1/html/graphics.html#package-ggplot2 ggplot2 interface] allows python users to access powerful plotting functionality in R with an intuitive API. Providing similar high level APIs for biological toolkits available in R would help expose these toolkits to a wider audience of Python programmers. A nice introduction to phylogenetic analysis in R is available from Rich Glor at the [http://bodegaphylo.wikispot.org/Phylogenetics_and_Comparative_Methods_in_R Bodega Bay Marine Lab wiki]. Some examples of R libraries for which integration would be welcomed are:
+
 
+
:* [http://ape.mpl.ird.fr/ ape (Analysis of Phylogenetics and Evolution)] -- an interactive library environment for phylogenetic and evolutionary analyses
+
:* [http://pbil.univ-lyon1.fr/ADE-4/home.php?lang=eng ade4] -- Data Analysis functions to analyse Ecological and Environmental data in the framework of Euclidean Exploratory methods
+
:* [http://cran.r-project.org/web/packages/geiger/index.html geiger] -- Running macroevolutionary simulation, and estimating parameters related to diversification from comparative phylogenetic data.
+
:* [http://picante.r-forge.r-project.org/ picante] -- R tools for integrating phylogenies and ecology
+
:* [http://mefa.r-forge.r-project.org/ mefa] -- multivariate data handling for ecological and biogeographical data
+
 
+
; Challenges : The student would have the opportunity to learn an available R toolkit, and then code in Python and R to make this available via an intuitive API. This will involve digging into the R code examples to discover the most useful parts for analysis, and then projecting this into a library that is intuitive to Python coders. Beyond the coding and design aspects, the student should feel comfortable writing up use case documentation to support the API and encourage its adoption.
+
  
 
; Involved toolkits or projects :
 
; Involved toolkits or projects :
  
:* [http://ape.mpl.ird.fr/ ape (Analysis of Phylogenetics and Evolution)]
+
:* Biopython
:* [http://bitbucket.org/lgautier/rpy2/ Rpy2]
+
:* [http://biopython.org/wiki/Main_Page Biopython]
+
  
; Degree of difficulty and needed skills : Moderate. The project requires familiarity with coding in Python and R, and knowledge of phylogeny or biogeography. The student has plenty of flexibility to define the project based on their biological interests (e.g. [http://www.warwick.ac.uk/go/peter_cock/python/heatmap/ microarrays and heatmaps]); there is also the possibility to venture far into data visualization once access to analysis methods is made. [http://kiwi.cs.dal.ca/GenGIS/Main_Page GenGIS] and can give ideas about what is possible.
+
; Degree of difficulty and needed skills : Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using [[bp:HOWTO:SearchIO|BioPerl's SearchIO]]. You will also need to know or learn the git version control system.
  
; Mentors : [http://dk.linkedin.com/pub/laurent-gautier/8/81/869 Laurent Gautier], [http://bcbio.wordpress.com Brad Chapman], [http://www.scri.ac.uk/staff/petercock Peter Cock]
+
; Mentors : Peter Cock
  
 +
=== Representation and manipulation of genomic variants ===
  
=== PDB-Tidy: command-line tools for manipulating PDB files ===
+
; Rationale : Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.
  
; Rationale : The [http://www.rcsb.org/pdb/home/home.do Protein Data Bank] is an important data repository for protein structures, but the tools currently available for working with data in the PDB file format are usually specialized for a single specific task (e.g. visualization, homology modelling). Structural biologists would benefit from a command-line toolkit that makes structure data as easy to manipulate as sequence data already is.
+
; Approach and Goals
 +
* Object representation
 +
** identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
 +
** develop internal machine representation for variation types
 +
** ensure coverage of essential standards, including HGVS, GFF, VCF
 +
* External representations
 +
** write parser and generators between objects and external string and file formats
 +
* Manipulations
 +
** canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).
 +
** develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)
 +
* Other
 +
** release code to appropriate community efforts and write short manuscript
 +
** implement web service for HGVS conversion
  
; Approach : Use Bio.PDB to build a set of simple command-line tools for tidying up PDB files. For example:
+
; Challenges : The major challenge in this project is to design an API that separates internal representations of variation from the multiple external representations. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.
 
+
:* Renumber residues starting from 1 (or N)
+
:* Select one or more chains and write them to a new PDB file
+
:* Check for collisions between atoms, implausible bond angles, etc.
+
:* Extract a SeqRecord from a PDB structure, and use SeqIO to write a new file in any supported format for sequence data
+
:* Incorporate predicted secondary-structure information into a PDB file (so that PyMol etc. can use it)
+
:* Extend and improve Bio.PDB as appropriate to support this effort
+
 
+
; Challenges : Many PDB files contain some inconsistent or surprising features -- some sensible assumptions, like continuous numbering of residues, do not hold in all cases. So, awareness of these issues, defensive coding, and extensive testing will be necesssary.
+
 
+
; Involved toolkits or projects :
+
  
:* Biopython: Bio.PDB, and other modules as needed
+
; Resources
:* Protein Data Bank
+
* [http://biopython.org BioPython]
:* For gathering ideas: MolProbity, PyMol, etc.
+
* [https://github.com/jamescasbon/PyVCF PyVCF]
 +
* [http://www.cgat.org/~andreas/documentation/pysam/api.html#pysam.VCF pysam VCF support]
 +
* [http://biopython.org/wiki/GFF_Parsing Biopython GFF support]
 +
* [http://www.mutalyzer.nl/2.0/ HGVS "nomenclature"]
 +
* [http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 VCF], [VCFtools http://vcftools.sourceforge.net/]
 +
* [http://www.sequenceontology.org/gff3.shtml GFF3], [http://www.sequenceontology.org/resources/gvf.html GVF]
 +
* [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit GATK]
  
; Degree of difficulty and needed skills : Moderate. Knowledge of the types of information in a PDB file, and what they're used for, is valuable here. It is also good to be aware of functionality that is already available in other popular software, and aim for interoperability in those cases rather than duplicating major features.
+
; Degree of difficulty and needed skills : Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.
  
; Mentors : [[User:EricTalevich|Eric Talevich]]
+
; Mentors: [http://linkedin.com/in/reece Reece Hart] ([http://locusdevelopmentinc.com Locus Development], San Francisco); [http://bcbio.wordpress.com Brad Chapman]; [http://casbon.me James Casbon]

Revision as of 11:01, 28 May 2012

As part of the Open Bioinformatics Foundation, Biopython is participating in Google Summer of Code (GSoC) again in 2012. This page contains a list of project ideas for the upcoming summer; potential GSoC students can base an application on any of these ideas, or propose something new.

In 2009, Biopython was involved with GSoC in collaboration with our friends at NESCent, and had two projects funded:

In 2010, another project was funded:

In 2011, three projects were funded in Biopython via the OBF:

In 2012, two projects were funded in Biopython via the OBF:

  • Wibowo Arindrarto: SearchIO Implementation in Biopython (blog)
  • Lenna Peterson: Diff My DNA: Development of a Genomic Variant Toolkit for Biopython (blog)

Please read the GSoC page at the Open Bioinformatics Foundation and the main Google Summer of Code page for more details about the program. If you are interested in contributing as a mentor or student next year, please introduce yourself on the mailing list.

2012 Project ideas

SearchIO

Rationale 
Biopython has general APIs for parsing and writing assorted sequence file formats (SeqIO), multiple sequence alignments (AlignIO), phylogenetic trees (Phylo) and motifs (Bio.Motif). An obvious omission is something equivalent to BioPerl's SearchIO. The goal of this proposal is to develop an easy-to-use Python interface in the same style as SeqIO, AlignIO, etc but for pairwise search results. This would aim to cover EMBOSS muscle & water, BLAST XML, BLAST tabular, HMMER, Bill Pearson's FASTA alignments, and so on.

Much of the low level parsing code to handle these file formats already exists in Biopython, and much as the SeqIO and AlignIO modules are linked and share code, similar links apply to the proposed SearchIO module when using pairwise alignment file formats. However, SearchIO will also support pairwise search results where the pairwise sequence alignment itself is not available (e.g. the default BLAST tabular output). A crucial aspect of this work will be to design a pairwise-search-result object heirachy that reflects this, probably with a subclass inheriting from both the pairwise-search-result and the existing MultipleSequenceAlignment object.

Beyond the initial challenge of an iterator based parsing and writing framework, random access akin to the Bio.SeqIO.index and index_db functionality would be most desirable for working with large datasets.

Challenges 
The project will cover a range of important file formats from major Bioinformatics tools, thus will require familiarity with running these tools, and understanding their output and its meaning. Inter-converting file formats is part of this.
Involved toolkits or projects 
  • Biopython
Degree of difficulty and needed skills 
Medium/Hard depending on how many objectives are attempted. The student needs to be fluent in Python. Experience with all of the command line tools listed would be clear advantages, as would first hand experience using BioPerl's SearchIO. You will also need to know or learn the git version control system.
Mentors 
Peter Cock

Representation and manipulation of genomic variants

Rationale 
Computational analysis of genomic variation requires the ability to reliably communicate and manipulate variants. The goal of this project is to provide facilities within BioPython to represent sequence variation objects, convert them to and from common human and file representations, and provide common manipulations on them.
Approach and Goals
  • Object representation
    • identify variation types to be represented (SNV, CNV, repeats, inversions, etc)
    • develop internal machine representation for variation types
    • ensure coverage of essential standards, including HGVS, GFF, VCF
  • External representations
    • write parser and generators between objects and external string and file formats
  • Manipulations
    • canonicalize variations with more than one valid representation (e.g., ins versus dup and left shifting repeats).
    • develop coordinate mapping between genomic, cDNA, and protein sequences (HGVS)
  • Other
    • release code to appropriate community efforts and write short manuscript
    • implement web service for HGVS conversion
Challenges 
The major challenge in this project is to design an API that separates internal representations of variation from the multiple external representations. Ideally, the libraries developed in this project will provide low-level functionality of coordinate conversion and parsing, and high-level functionality for the most common use cases. This aim requires analyzing the proposals to determine which aspects may be impossible or difficult to represent with a simple grammar.
Resources
Degree of difficulty and needed skills 
Easy-to-Medium depending on how many objectives are attempted. The student will need have skills in most or all of: basic molecular biology (genomes, transcripts, proteins), genomic variation, Python, BioPython, Perl, BioPerl, NCBI Eutilities and/or Ensembl API. Experience with computer grammars is highly desirable. You will also need to know or learn the git version control system.
Mentors
Reece Hart (Locus Development, San Francisco); Brad Chapman; James Casbon
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox