Phylo cookbook

(Difference between revisions)
Jump to: navigation, search
(Comparing trees: techniques mentioned in Inferring Phylogenies (Felsenstein 2004))
(Convenience functions: index clades by name)
Line 1: Line 1:
 
Here are some examples of using [[Phylo|Bio.Phylo]] for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54.
 
Here are some examples of using [[Phylo|Bio.Phylo]] for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54.
  
==Consensus methods==
+
==Convenience functions==
  
''TODO:''
+
===Index clades by name===
 +
 
 +
For large trees it can be useful to be able to select a clade by name, or some other unique identifier, rather than searching the whole tree for it during each operation.
 +
 
 +
<python>
 +
def lookup_by_names(tree):
 +
    names = {}
 +
    for clade in tree.find_clades():
 +
        if clade.name:
 +
            if clade.name in names:
 +
                raise ValueError("Duplicate key: %s" % clade.name)
 +
            names[clade.name] = clade
 +
    return names
 +
</python>
 +
 
 +
Now you can retrieve a clade by name in constant time:
 +
 
 +
<python>
 +
tree = Phylo.read('ncbi_taxonomy.xml', 'phyloxml')
 +
names = lookup_by_names(tree)
 +
for phylum in ('Apicomplexa', 'Euglenozoa', 'Fungi'):
 +
    print "Phylum size", len(names[phylum].get_terminals())
 +
</python>
 +
 
 +
A potential issue: The above implementation of lookup_by_names doesn't include unnamed clades, generally internal nodes. We can fix this by adding a unique identifier for each clade. Here, all clade names are prefixed with a unique number (which can be useful for searching, too):
 +
 
 +
<python>
 +
def tabulate_names(tree):
 +
    names = {}
 +
    for idx, clade in enumerate(tree.find_clades()):
 +
        if clade.name:
 +
            clade.name = '%d_%s' % (idx, clade.name)
 +
        else:
 +
            clade.name = str(idx)
 +
        names[clade.name] = clade
 +
    return clade
 +
</python>
  
* Majority-rules consensus
 
* Adams ([http://www.faculty.biol.ttu.edu/Strauss/Phylogenetics/Readings/Adams1972.pdf Adams 1972])
 
* Asymmetric median tree ([http://www.springerlink.com/content/y1x70058822qg257/ Phillips and Warnow 1996])
 
  
 
==Comparing trees==
 
==Comparing trees==
Line 17: Line 50:
 
* Nearest-neighbor interchange
 
* Nearest-neighbor interchange
 
* Path-length-difference
 
* Path-length-difference
 +
 +
 +
==Consensus methods==
 +
 +
''TODO:''
 +
 +
* Majority-rules consensus
 +
* Adams ([http://www.faculty.biol.ttu.edu/Strauss/Phylogenetics/Readings/Adams1972.pdf Adams 1972])
 +
* Asymmetric median tree ([http://www.springerlink.com/content/y1x70058822qg257/ Phillips and Warnow 1996])
 +
  
 
==Rooting methods==
 
==Rooting methods==
Line 31: Line 74:
  
 
* Party tricks with <code>draw_graphviz</code>, covering each keyword argument
 
* Party tricks with <code>draw_graphviz</code>, covering each keyword argument
 +
  
 
==Exporting to other types==
 
==Exporting to other types==

Revision as of 04:14, 26 June 2010

Here are some examples of using Bio.Phylo for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54.

Contents

Convenience functions

Index clades by name

For large trees it can be useful to be able to select a clade by name, or some other unique identifier, rather than searching the whole tree for it during each operation.

def lookup_by_names(tree):
    names = {}
    for clade in tree.find_clades():
        if clade.name:
            if clade.name in names:
                raise ValueError("Duplicate key: %s" % clade.name)
            names[clade.name] = clade
    return names

Now you can retrieve a clade by name in constant time:

tree = Phylo.read('ncbi_taxonomy.xml', 'phyloxml')
names = lookup_by_names(tree)
for phylum in ('Apicomplexa', 'Euglenozoa', 'Fungi'):
    print "Phylum size", len(names[phylum].get_terminals())

A potential issue: The above implementation of lookup_by_names doesn't include unnamed clades, generally internal nodes. We can fix this by adding a unique identifier for each clade. Here, all clade names are prefixed with a unique number (which can be useful for searching, too):

def tabulate_names(tree):
    names = {}
    for idx, clade in enumerate(tree.find_clades()):
        if clade.name:
            clade.name = '%d_%s' % (idx, clade.name)
        else:
            clade.name = str(idx)
        names[clade.name] = clade
    return clade


Comparing trees

TODO:

  • Symmetric difference / partition metric, a.k.a. topological distance
  • Quartets distance
  • Nearest-neighbor interchange
  • Path-length-difference


Consensus methods

TODO:


Rooting methods

TODO:

  • Root at the midpoint between the two most distant nodes (or "center" of all tips)
  • Root with the given outgroup (terminal or nonterminal)


Graphics

TODO:

  • Party tricks with draw_graphviz, covering each keyword argument


Exporting to other types

Convert to a PyCogent tree

The tree objects used by Biopython and PyCogent are different. Nonetheless, both toolkits support the Newick file format, so interoperability is straightforward at that level:

from Bio import Phylo
import cogent
 
Phylo.write(bptree, 'mytree.nwk', 'newick')  # Biopython tree
ctree = cogent.LoadTree('mytree.nwk')        # PyCogent tree

TODO:

  • Convert objects directly, preserving some PhyloXML annotations if possible


Convert to a NumPy array or matrix

TODO:

  • Adjacency matrix: cells are True if parent-child relationship exists, otherwise False
  • Distance matrix: cells are branch lengths if a branch exists, otherwise Inf or NaN
  • Relationship matrix? See Martins and Housworth 2002
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox