GSOC2011 MocapyExt

(Difference between revisions)
Jump to: navigation, search
(References)
Line 1: Line 1:
 
BioPython is a very popular library in Bioinformatics and Computational Biology.Mocapy++1 is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs)<cite>PaluszewskiHamelryck2010</cite>, which encode probabilistic relationships among random variables in a domain. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure.
 
BioPython is a very popular library in Bioinformatics and Computational Biology.Mocapy++1 is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs)<cite>PaluszewskiHamelryck2010</cite>, which encode probabilistic relationships among random variables in a domain. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure.
  
Such a highly useful and powerful library, which has been used in such projects as TorusDBN <cite>BMTFKH2008</cite>, Basilisk<cite>HBPFJH2010</cite><cite>PaluszewskiWinter2008</cite> with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has
+
Such a highly useful and powerful library, which has been used in such projects as TorusDBN<cite>BMTFKH2008</cite>, Basilisk<cite>HBPFJH2010</cite>, FB5HMM<cite>HKK2006</cite><cite>PaluszewskiWinter2008</cite> with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has
 
no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.
 
no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.
  

Revision as of 16:51, 26 August 2011

BioPython is a very popular library in Bioinformatics and Computational Biology.Mocapy++1 is a machine learning toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs)[1], which encode probabilistic relationships among random variables in a domain. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics. Notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure.

Such a highly useful and powerful library, which has been used in such projects as TorusDBN[2], Basilisk[3], FB5HMM[4][5] with great success, is the result of the long-term effort. The original Mocapy implementation dates back to 2004, and since then the library has been rewritten in C++. However, C++ is a statically typed and compiled programming language, which does not facilitate rapid prototyping. As a result, currently Mocapy++ has no provisions for dynamic loading of custom node types, and a mechanism to plug-in new node types that would not require to modify and recompile the library is of interest. Such a plug-in interface would assist rapid prototyping by allowing to quickly implement and test new probability distributions, which, in turn, could substantially reduce development time and effort; the user would be empowered to extend Mocapy++ without modifications and subsequent recompilations. Recognizing this need, the project (herein referred as MocapyEXT), with the aim to improve the current Mocapy++ node type extension mechanism, has been proposed by T. Hamelryck.


Contents

Introduction

Mocapy++ is an open source toolkit that is freely available under the GNU General Public Licence (GPL) from SourceForge. Mocapy++ is used for parameter learning and inference in dynamic Bayesian networks (DBNs), which encode probabilistic relationships among random variables in a domain. The library supports a wide spectrum of DBN architectures and probability distributions, including distributions from directional statistics; notably, Kent distribution on the sphere and the bivariate von Mises distribution on the torus, which have proven to be useful in formulating probabilistic models of protein and RNA structure. This makes Mocapy++ especially suitable for biomolecular structure prediction.

The goal of the MocapyExt project is to develop a plugin interface for Mocapy++. This plugin interface will allow to quickly implement and test new probability distributions and thus save development time and effort as it then becomes possible to extend Mocapy++ without undue modifications and subsequent recompilations.

Author & Mentors

Justinas V. Daugmaudis vygis.d@gmail.com

Mentors

Thomas Hamelryck
Eric Talevich

Work Plan

Gain understanding of S-EM and directional statistics

  • Review the course material in SCI-B2-1011-Structural bioinformatics;
  • Study the theory for Markov chain Monte Carlo methods and dynamic Bayesian networks;
  • Supplement the base knowledge by reading relevant papers.


Study of Mocapy++ use cases

  • Study the Mocapy++ implementation of Stochastic Expectation Maximization (S-EM) and parameter learning of Bayesian networks;
  • Review examples distributed with Mocapy++ library.


Study of Mocapy++ internals and code

  • Distill the relevant concepts for nodes and densities. Currently, there are no such concepts defined within the Mocapy++ framework;
  • Compare with other examples that implement probability distributions.


Design of Mocapy++ plugin interface

  • Plugin module has to implement the model of NodeConcept/DensityConcept;
  • All the models have to be uniformly accessible through the plugin system.


Implement Mocapy++ plugin module

  • Implement test cases in C++ using the new plugin interface;
  • Provide Python bindings for the defined interface;
  • Implement test cases in Python. Verify that this plugin system can be used transparently in Python.


Experiment with the modular Mocapy++ architecture

  • Create sample applications that showcase the advantages of modularity that is provided by the plugin system in Mocapy++;
  • Measure performance.

Schedule

Week 1-2 [May 23th -- June 5th]

Design of interface strategy: it is hard to overstate the importance of the plugin API feeling "natural" to a Mocapy++ user. So much of the time will be devoted to the design of the interface.


Week 3-5 [June 6th -- June 19th]

Implement the plugin module.


Week 6-7 [June 20th -- June 30th]

Implement some examples that would showcase Mocapy++ plugin system in action; for example, how to use externally implemented logistic distribution.


Week 7-8 (Midterm) [July 1st -- July 10th]

  • Thorough testing and consolidating the features. Writing the documentation for each feature.
  • Mid-term Evaluations. Discuss with the mentor the state of the project and adjust the schedule to comply with project's needs.


Week 9-10 [July 11th -- July 24th]

Example applications. Should focus on the type reflection capabilities of the plugin module.


Week 11-12 [July 25th -- August 7th]

Update the documentation to reflect the new functions. Also document the examples, do any scrub work of code, etc. Planned 'pencils down' date.


Week 13-14 [August 8th -- August 21st]

Introduce the bindings to the wider audiences, gather the opinions and the reviews in community.


Week 15 [August 22nd]

The end of the project.

Source Code

Hosted at the gSoC11 Mocapy branch

Progress

Definitions

Prototype implementation

Embedding the Python interpreter

For a client program to be able to execute Python code, it is necessary to initialize the scripting environment. It is done by invoking the Py_Initialize() function. The interpreter is released by invoking the Py_Finalize() function.

It is important to note, however, that any statement between Py_Initialize() and Py_Finalize() calls could throw an exception. If an exception is thrown, it must be either handled in a try/catch block, or the program must be terminated. Having this in mind, an previous example program could be made more robust by enveloping the statements between Py_Initialize() and Py_Finalize() in a try/catch block. Py_Finalize() safety could not be overstated. Currently Boost.Python has several global (or function-static) objects whose existence keeps reference counts from dropping to zero until the Boost.Python shared object is unloaded. This can cause a crash because when the reference counts do go to zero, there is no interpreter. This poses a question if such a method of initializing the Python interpreter could be considered to be "easy to use", "safe" or even "non-intrusive", which are the major MocapyEXT design tenets.

The solution to this issue that is easy to use, safe and non-intrusive is surprisingly elegant and showcases the RAII (Resource Acquisition Is Initialization) idiom.

The lifetime of the Python interpreter

The Python interpreter is initialized before the entry to the main function, and is released after the exit from the main function. No additional end-user effort is required, except to include the headers that define the necessary Python plug-in wrappers.

Compile/link dependencies

The Python interpreter shall be linked to the client program (not to the Mocapy++ library). The end-user of the MocapyEXT shall be responsible for additional include and/or library paths to the Python headers and libraries that are necessary to compile and link the client program successfully.

MocapyEXT also instantiates its static data members in such a manner that they are instantiated only once and in the library's compilation unit. In the C++ standard it is explicitly stated that

"... in particular, the initialization (and any associated side-effects) of a static data member does not occur unless the static data member is itself used in a way that requires the definition of the static data member to exist."[cpp_std2003, temp.inst]

The user shall not need to manage the instantiated static data members.

Thread safety guarantees

The MocapyEXT plug-ins have the following thread-safety guarantees:

  • it is safe to call plug-in member functions from multiple threads;
  • it is safe to perform concurrent read-only accesses of the passed parameter containers;
  • it is safe to perform concurrent read/write accesses, if there is only one read or write access to the passed factual parameters at a time.

Concurrent mutable access of the passed parameter containers requires synchronization, for example via reader-writer lock. Mutable access includes altering values within the container, invoking member functions that invalidate iterators, moving the container via the move constructor.

The module search path

When a module foo that contains ESS or Densities definitions is being imported, the embedded Python interpreter searches for a file named foo.py in the list of directories specified by the environment variable PYTHONPATH and the current path.

Before proceeding to import a module, sys.path list is explicitly appended with the current path. A user that relies on scripts being in some other directory than a client program should list the paths in the PYTHONPATH environment variable.

Note that because the interpreter also searches in an installation-dependent default path, it is important that the file containing ESS or Densities definitions not have the same name as a standard module.

Plugin registration

The minimal interface import example shows the usage of three plug-in classes: densities_plugin, ess_plugin and aggregate_plugin. The intended usage pattern for densities_plugin and ess_plugin are cases when types ESS and Densities are implemented in separate files -- one-class-per-file. Each class is loaded in their own interpreter context. The purpose of aggregate_plugin is to simplify the management of the the loaded types and to optimize the resources allocated for the embedded Python interpreter as aggregate_plugin creates only one instance of the embedded interpreter for both types ESS and Densities.

 #include <mocapy/plugin/ess_plugin.hpp>
 #include <mocapy/plugin/densities_plugin.hpp>
 #include <mocapy/plugin/aggregate_plugin.hpp>
 
 int main()
 {
   using namespace mocapy::ext;
   densities_plugin dens_pl("plugin_tests", "DensitiesPython");
   ess_plugin ess_pl("plugin_tests", "ESSPython");
   aggregate_plugin aggr_pl("plugin_tests", "ESSPython", "DensitiesPython");
   return 0;
 }

Basically, a user provides a name of a python library, a name of class and a list of factual arguments necessary for the construction of the model of the class. The MocapyExt library in turn then returns an instance of the new class.

 densities_plugin p("test_module", "DensitiesClassName");
 densities_adapter n = p.densities(arg1, arg2, ..., argN);

The given argument list (up to N = 6 arguments are supported) would parametrize the initialization of Densities object. Parameters are forwarded to the Python API and converted using the type conversion facilities provided by the Boost.Python library. The mechanism is general and the user can convert any arbitrary type T to its equivalent in Python.

The parameters are forwarded via const-references. This method accepts and forwards arbitrary typed arguments, at the cost of always treating the argument as const. This solution is typically used for constructor arguments. An esoteric problem with this approach is that it is not possible to form a const reference to a function type, but this deficiency, which is being addressed in Core Issue 295, actually works in our favor, because this prevents ill-intended attempts to pass a function to the Python interpreter.

Any further invocation of ess and dens member functions automatically calls a corresponding class method in Python.

The lifetime of a class instance

The lifetime of ESS computer and Densities objects (respectively, ess and dens) may exceed the lifetime of their respective factory plugin(s). Objects ess and dens preserve reference counted pointers to the Python interpreter, thus a valid Python interpreter instance exists until the destruction of objects ess and dens.

Creation of a node

This is a simple example that showcases two different methods to initialize a plugin node. In this particular case, the module plugin_tests that has been used in the program implements a dummy discrete node of fixed length 1, and always returns [0,] as sampled value.

 #include <mocapy/plugin/ess_plugin.hpp>
 #include <mocapy/plugin/densities_plugin.hpp>
 #include <mocapy/plugin/aggregate_plugin.hpp>
 #include <mocapy/plugin/plugin_node.hpp>
 
 int main()
 {
   using namespace mocapy::ext;
   // Node initialization from two separately loaded modules
   {
     ess_plugin ess("plugin_tests", "ESSPython");
     densities_plugin dens("plugin_tests", "DensitiesPython");
     plugin_node_type node(ess.ess(arg1, arg2, ..., argN), dens.densities(arg1, arg2, ..., argN));
   }
   // Node initialization from a single loaded module
   {
     aggregate_plugin pl("plugin_tests", "ESSPython", "DensitiesPython");
     plugin_node_type node(pl.ess(arg1, arg2, ..., argN), pl.densities(arg1, arg2, ..., argN));
   }
   return 0;
 }

plugin_node_type is a typedef of the ChildNode class template. Here we also note that the lifetime of a node is independent from the lifetime of plug-in factories.

Node output

A plugin node is Streamable, i.e., it can be output to a std::ostream via the operator

std::ostream& operator<<(std::ostream&, const plugin_node_type&);

The content of the output is the same as repr(Z), where Z is either ESS computer or Densities object.

Node serialization

The MocapyEXT preserves the old Mocapy++ ChildNode class template serialization behaviour, which means that for ESS computers that do not need to be serialized (this includes kentess, gaussianess, poissoness, and others) no serialization will happen. However, the MocapyEXT necessarily has to serialize the Python plugin that implements an ESS computer in order to preserve the module and the name of the class, so that the ESS computer could be later deserialized.

It also must be noted that after deserialization the ESS computer and Densities will not share the same Python interpreter instance, even if they are implemented in the same module.

Measuring performance

It is interesting to measure how much plugin interface costs compared to the implementation of ESS and Densities computers themselves.

We measure the relative performance of tests named N, S, and A. All tests invoke the member functions of the ESS and Densities computers, albeit no particular computation is carried out.

N test stands for N(ative), which means that pure C++ implementation of ESS and Densities computers were used in the test.

S test stands for S(eparate); Python classes are loaded in separate Python interpreter instances.

A test stands for A(ggregate); ESS and Densities computers are loaded in the same Python interpreter instance.


            Conf. int.     4.53e+04     5.03e+04     5.53e+04

Speedup percentages based on Wilcoxon matched pair confidence intervals.

  Func. A vs   Func. B      Minimum       Median      Maximum
                         (% faster)   (% faster)   (% faster)
     N    vs   S                717          719          724
     N    vs   A                718          725          730
     A    vs   S             -0.673       -0.251       0.0774


Test results show that using MocapyEXT plugin interface incurs some performance penalty. It must be noted, however, that tests also performed repetitive construction of ESS and Densities instances.


            Conf. int.     1.94e+04     2.15e+04     2.37e+04

Speedup percentages based on Wilcoxon matched pair confidence intervals.

  Func. A vs   Func. B      Minimum       Median      Maximum
                         (% faster)   (% faster)   (% faster)
     N    vs   S                240          242          243
     N    vs   A                239          243          245
     A    vs   S              -1.24        -0.51        -0.25


Repeated tests without the construction of ESS and Densities objects show that invocation of a method through the MocapyEXT interface is only ~2,5 times slower than to invoke a member function of natively implemented ESS and Densities objects. It is obvious that in a real-world scenario node construction, sampling, likelihood computation, etc. would be more likely to constitute the better part of the running time as compared to the invocations of methods themselves. Basically, these tests show that member function invocation takes much less time than even such a 'lightweight' operation as node construction. In principle it is now possible to write quite generic algorithms. Consider this example:

  template<typename Densities>
  inline void do_something_with_densities(Densities dens) // Generic function 
  {
    std::vector<unsigned int> ps(1, 2);
    dens.construct(ps);
    /* do something here */
  }
 
  // Bippo Densities though the Python interface
  densities_plugin dens("mocapy.bippo", "BippoDensities");
  do_something_with_densities( dens.densities(lambdas, thetas, ns) );
 
  // Direct use of Bippo Densities
  mocapy::BippoDensities bd(lambdas, thetas, ns);
  do_something_with_densities(bd);

We can be reasonably sure that performance penalty of using MocapyEXT, while not negligible, will not outweight the added value of genericity. And in the pure C++ scenario it does not incur any performance cost.

References

  1. M. Paluszewski and T. Hamelryck. Mocapy++ -- a toolkit for inference and learning in dynamic bayesian networks. BMC bioinformatics, 11(1):126, 2010. [PaluszewskiHamelryck2010]
  2. W. Boomsma, K.V. Mardia, C.C. Taylor, J. Ferkinghoff-Borg, A. Krogh, and T. Hamelryck. A generative, probabilistic model of local protein structure. Proceedings of the National Academy of Sciences, 105(26):8932–8937, 2008. [BMTFKH2008]
  3. Tim Harder, Wouter Boomsma, Martin Paluszewski, Jes Frellsen, Kristoffer Johansson, and Thomas Hamelryck. Beyond rotamers: a generative, probabilistic model of side chains in proteins. BMC Bioinformatics, 11(1):306, 2010. [HBPFJH2010]
  4. M. Paluszewski and P. Winter. Protein decoy generation using branch and bound with efficient bounding. Algorithms in Bioinformatics, pages 382–393, 2008. [PaluszewskiWinter2008]
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox