# Phylo cookbook

EricTalevich (Talk | contribs) (→Comparing trees: techniques mentioned in Inferring Phylogenies (Felsenstein 2004)) |
m (→Convert to an 'ape' tree, via Rpy2) |
||

(13 intermediate revisions by one user not shown) | |||

Line 1: | Line 1: | ||

Here are some examples of using [[Phylo|Bio.Phylo]] for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54. | Here are some examples of using [[Phylo|Bio.Phylo]] for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54. | ||

− | == | + | ==Convenience functions== |

− | + | ===Get the parent of a clade=== | |

− | + | The Tree data structures in Bio.Phylo don't store parent references for each clade. Instead, the <code>get_path</code> method can be used to trace the path of parent-child links from the tree root to the clade of choice: | |

− | + | ||

− | + | <python> | |

+ | def get_parent(tree, child_clade): | ||

+ | node_path = tree.get_path(child_clade) | ||

+ | return node_path[-2] | ||

+ | |||

+ | # Select a clade | ||

+ | myclade = tree.find_clades("foo").next() | ||

+ | # Test the function | ||

+ | parent = get_parent(tree, myclade) | ||

+ | assert myclade in parent | ||

+ | </python> | ||

+ | |||

+ | Note that <code>get_path</code> has a linear run time with respect to the size of the tree -- i.e. for best performance, don't call <code>get_parent</code> or <code>get_path</code> inside a time-critical loop. If possible, call <code>get_path</code> outside the loop, and look up parents in the list returned by that function. | ||

+ | |||

+ | Alternately, if you need to repeatedly look up the parents of arbitrary tree elements, create a dictionary mapping all nodes to their parents: | ||

+ | |||

+ | <python> | ||

+ | def all_parents(tree): | ||

+ | parents = {} | ||

+ | for clade in tree.find_clades(order='level'): | ||

+ | for child in clade: | ||

+ | parents[child] = clade | ||

+ | return parents | ||

+ | |||

+ | # Example | ||

+ | parents = all_parents(tree) | ||

+ | myclade = tree.find_clades("foo").next() | ||

+ | parent_of_myclade = parents[myclade] | ||

+ | assert myclade in parent_of_myclade | ||

+ | </python> | ||

+ | |||

+ | |||

+ | ===Index clades by name=== | ||

+ | |||

+ | For large trees it can be useful to be able to select a clade by name, or some other unique identifier, rather than searching the whole tree for it during each operation. | ||

+ | |||

+ | <python> | ||

+ | def lookup_by_names(tree): | ||

+ | names = {} | ||

+ | for clade in tree.find_clades(): | ||

+ | if clade.name: | ||

+ | if clade.name in names: | ||

+ | raise ValueError("Duplicate key: %s" % clade.name) | ||

+ | names[clade.name] = clade | ||

+ | return names | ||

+ | </python> | ||

+ | |||

+ | Now you can retrieve a clade by name in constant time: | ||

+ | |||

+ | <python> | ||

+ | tree = Phylo.read('ncbi_taxonomy.xml', 'phyloxml') | ||

+ | names = lookup_by_names(tree) | ||

+ | for phylum in ('Apicomplexa', 'Euglenozoa', 'Fungi'): | ||

+ | print "Phylum size", len(names[phylum].get_terminals()) | ||

+ | </python> | ||

+ | |||

+ | A potential issue: The above implementation of lookup_by_names doesn't include unnamed clades, generally internal nodes. We can fix this by adding a unique identifier for each clade. Here, all clade names are prefixed with a unique number (which can be useful for searching, too): | ||

+ | |||

+ | <python> | ||

+ | def tabulate_names(tree): | ||

+ | names = {} | ||

+ | for idx, clade in enumerate(tree.find_clades()): | ||

+ | if clade.name: | ||

+ | clade.name = '%d_%s' % (idx, clade.name) | ||

+ | else: | ||

+ | clade.name = str(idx) | ||

+ | names[clade.name] = clade | ||

+ | return clade | ||

+ | </python> | ||

+ | |||

+ | |||

+ | ===Calculate distances between neighboring terminals=== | ||

+ | |||

+ | ''Suggested by Joel Berendzen'' | ||

+ | |||

+ | <python> | ||

+ | import itertools | ||

+ | |||

+ | def terminal_neighbor_dists(self): | ||

+ | """Return a list of distances between adjacent terminals.""" | ||

+ | def generate_pairs(self): | ||

+ | pairs = itertools.tee(self) | ||

+ | pairs[1].next() | ||

+ | return itertools.izip(pairs[0], pairs[1]) | ||

+ | return [self.distance(*i) for i in | ||

+ | generate_pairs(self.find_clades(terminal=True))] | ||

+ | </python> | ||

+ | |||

+ | |||

+ | ===Test for "semi-preterminal" clades=== | ||

+ | |||

+ | ''Suggested by Joel Berendzen'' | ||

+ | |||

+ | The existing tree method <code>is_preterminal</code> returns True if all of the direct descendants are terminal. This snippet will instead return True if ''any'' direct descendent is terminal, but still False if the given clade itself is terminal. | ||

+ | |||

+ | <python> | ||

+ | def is_semipreterminal(clade): | ||

+ | """True if any direct descendent is terminal.""" | ||

+ | for child in clade: | ||

+ | if child.is_terminal(): | ||

+ | return True | ||

+ | return False | ||

+ | </python> | ||

+ | |||

+ | In Python 2.5 and later, this is simplified with the built-in <code>any</code> function: | ||

+ | |||

+ | <python> | ||

+ | def is_semipreterminal(clade): | ||

+ | return any(child.is_terminal() for child in clade) | ||

+ | </python> | ||

==Comparing trees== | ==Comparing trees== | ||

Line 13: | Line 122: | ||

''TODO:'' | ''TODO:'' | ||

− | * Symmetric difference / partition metric, a.k.a. topological distance | + | * Symmetric difference / partition metric, a.k.a. topological distance (Robinson-Foulds) |

* Quartets distance | * Quartets distance | ||

* Nearest-neighbor interchange | * Nearest-neighbor interchange | ||

* Path-length-difference | * Path-length-difference | ||

+ | |||

+ | ==Consensus methods== | ||

+ | |||

+ | ''TODO:'' | ||

+ | |||

+ | * Majority-rules consensus | ||

+ | * Strict consensus | ||

+ | * Adams ([http://www.faculty.biol.ttu.edu/Strauss/Phylogenetics/Readings/Adams1972.pdf Adams 1972]) | ||

+ | * Asymmetric median tree ([http://www.springerlink.com/content/y1x70058822qg257/ Phillips and Warnow 1996]) | ||

==Rooting methods== | ==Rooting methods== | ||

+ | |||

+ | The basic method on the Tree class (not TreeMixin) is <code>root_with_outgroup</code>: | ||

+ | |||

+ | <python> | ||

+ | tree = Phylo.read('example.nwk', 'newick') | ||

+ | print tree | ||

+ | # ... | ||

+ | tree.root_with_outgroups({'name': 'A'}) # Operates in-place | ||

+ | print tree | ||

+ | </python> | ||

+ | |||

+ | Normally you'll want the outgroup to be a monophyletic group, rather than a single taxon. This isn't automatically checked, but you can do it yourself with the <code>is_monophyletic</code> method. | ||

+ | |||

+ | To save some typing, try keeping the query in a separate list and reusing it: | ||

+ | |||

+ | <python> | ||

+ | outgroup = [{'name': taxon_name} for taxon_name in ('E', 'F', 'G')] | ||

+ | if tree.is_monophyletic(outgroup): | ||

+ | tree.root_with_outgroup(*outgroup) | ||

+ | else: | ||

+ | raise ValueError("outgroup is paraphyletic") | ||

+ | </python> | ||

''TODO:'' | ''TODO:'' | ||

* Root at the midpoint between the two most distant nodes (or "center" of all tips) | * Root at the midpoint between the two most distant nodes (or "center" of all tips) | ||

− | |||

− | |||

==Graphics== | ==Graphics== | ||

Line 31: | Line 169: | ||

* Party tricks with <code>draw_graphviz</code>, covering each keyword argument | * Party tricks with <code>draw_graphviz</code>, covering each keyword argument | ||

+ | |||

==Exporting to other types== | ==Exporting to other types== | ||

+ | |||

+ | ===Convert to an 'ape' tree, via Rpy2=== | ||

+ | |||

+ | The R statistical programming environment provides support for phylogenetics through the '[http://ape.mpl.ird.fr/ ape]' package and several others that build on top of 'ape'. The Python package [http://rpy.sourceforge.net/rpy2.html rpy2] provides an interface between R and Python, so it's possible to convert a Bio.Phylo tree into an 'ape' tree object: | ||

+ | |||

+ | <python> | ||

+ | import tempfile | ||

+ | from rpy2.robjects import r | ||

+ | |||

+ | def to_ape(tree): | ||

+ | """Convert a tree to the type used by the R package `ape`, via rpy2. | ||

+ | |||

+ | Requirements: | ||

+ | - Python package `rpy2` | ||

+ | - R package `ape` | ||

+ | """ | ||

+ | with tempfile.NamedTemporaryFile() as tmpf: | ||

+ | Phylo.write(tree, tmpf, 'newick') | ||

+ | tmpf.flush() | ||

+ | rtree = r(""" | ||

+ | library('ape') | ||

+ | read.tree('%s') | ||

+ | """ % tmpf.name) | ||

+ | return rtree | ||

+ | </python> | ||

+ | |||

+ | See that it works: | ||

+ | |||

+ | <python> | ||

+ | >>> from StringIO import StringIO | ||

+ | >>> from Bio import Phylo | ||

+ | >>> tree = Phylo.read(StringIO('(A,(B,C),(D,E));'), 'newick') | ||

+ | >>> rtree = to_ape(tree) | ||

+ | >>> len(rtree) | ||

+ | 3 | ||

+ | >>> print r.summary(rtree) | ||

+ | Phylogenetic tree: structure(list(edge = structure(c(6, 6, 7, 7, 6, 8, 8, 1, 7, 2, 3, 8, 4, 5), | ||

+ | .Dim = c(7L, 2L)), tip.label = c("A", "B", "C", "D", "E"), Nnode = 3L), | ||

+ | .Names = c("edge", "tip.label", "Nnode" ), class = "phylo") | ||

+ | |||

+ | Number of tips: 5 | ||

+ | Number of nodes: 3 | ||

+ | No branch lengths. | ||

+ | No root edge. | ||

+ | Tip labels: A | ||

+ | B | ||

+ | C | ||

+ | D | ||

+ | E | ||

+ | No node labels. | ||

+ | NULL | ||

+ | >>> r.plot(rtree) | ||

+ | </python> | ||

+ | |||

+ | See the rpy2 documentation for further guidance. | ||

===Convert to a PyCogent tree=== | ===Convert to a PyCogent tree=== | ||

Line 53: | Line 247: | ||

===Convert to a NumPy array or matrix=== | ===Convert to a NumPy array or matrix=== | ||

− | |||

− | + | Adjacency matrix: cells are 1 (true) if a parent-child relationship exists, otherwise 0 (false). | |

− | + | ||

+ | <python> | ||

+ | import numpy | ||

+ | |||

+ | def to_adjacency_matrix(tree): | ||

+ | """Create an adjacency matrix (NumPy array) from clades/branches in tree. | ||

+ | |||

+ | Also returns a list of all clades in tree ("allclades"), where the position | ||

+ | of each clade in the list corresponds to a row and column of the numpy | ||

+ | array: a cell (i,j) in the array is 1 if there is a branch from allclades[i] | ||

+ | to allclades[j], otherwise 0. | ||

+ | |||

+ | Returns a tuple of (allclades, adjacency_matrix) where allclades is a list | ||

+ | of clades and adjacency_matrix is a NumPy 2D array. | ||

+ | """ | ||

+ | allclades = list(tree.find_clades(order='level')) | ||

+ | lookup = {} | ||

+ | for i, elem in enumerate(allclades): | ||

+ | lookup[elem] = i | ||

+ | adjmat = numpy.zeros((len(allclades), len(allclades))) | ||

+ | for parent in tree.find_clades(terminal=False, order='level'): | ||

+ | for child in parent.clades: | ||

+ | adjmat[lookup[parent], lookup[child]] = 1 | ||

+ | if not tree.rooted: | ||

+ | # Branches can go from "child" to "parent" in unrooted trees | ||

+ | adjmat += adjmat.transpose | ||

+ | return (allclades, numpy.matrix(adjmat)) | ||

+ | </python> | ||

+ | |||

+ | |||

+ | Distance matrix: cell values are branch lengths if a branch exists, otherwise infinity. (This plays well with graph algorithms.) | ||

+ | |||

+ | <python> | ||

+ | import numpy | ||

+ | |||

+ | def to_distance_matrix(tree): | ||

+ | """Create a distance matrix (NumPy array) from clades/branches in tree. | ||

+ | |||

+ | A cell (i,j) in the array is the length of the branch between allclades[i] | ||

+ | and allclades[j], if a branch exists, otherwise infinity. | ||

+ | |||

+ | Returns a tuple of (allclades, distance_matrix) where allclades is a list of | ||

+ | clades and distance_matrix is a NumPy 2D array. | ||

+ | """ | ||

+ | allclades = list(tree.find_clades(order='level')) | ||

+ | lookup = {} | ||

+ | for i, elem in enumerate(allclades): | ||

+ | lookup[elem] = i | ||

+ | distmat = numpy.repeat(numpy.inf, len(allclades)**2) | ||

+ | distmat.shape = (len(allclades), len(allclades)) | ||

+ | for parent in tree.find_clades(terminal=False, order='level'): | ||

+ | for child in parent.clades: | ||

+ | if child.branch_length: | ||

+ | distmat[lookup[parent], lookup[child]] = child.branch_length | ||

+ | if not tree.rooted: | ||

+ | distmat += distmat.transpose | ||

+ | return (allclades, numpy.matrix(distmat)) | ||

+ | </python> | ||

+ | |||

+ | |||

+ | Enhancements: | ||

+ | * Use an OrderedDict for <code>allclades</code>, so the separate dictionary <code>lookup</code> isn't needed. (Python 2.7+) | ||

+ | * Use NumPy's [http://www.scipy.org/RecordArrays record array] to assign clade names to rows and columns of the matrix, so <code>allclades</code> isn't needed either. (This works nicely along with the <code>tabulate_names</code> function given earlier.) | ||

+ | |||

+ | |||

+ | ''TODO:'' | ||

* Relationship matrix? See [http://www.jstor.org/stable/3070822 Martins and Housworth 2002] | * Relationship matrix? See [http://www.jstor.org/stable/3070822 Martins and Housworth 2002] | ||

[[Category:Cookbook]] | [[Category:Cookbook]] |

## Revision as of 19:38, 28 May 2012

Here are some examples of using Bio.Phylo for some likely tasks. Some of these functions might be added to Biopython in a later release, but you can use them in your own code with Biopython 1.54.

## Contents |

## Convenience functions

### Get the parent of a clade

The Tree data structures in Bio.Phylo don't store parent references for each clade. Instead, the `get_path`

method can be used to trace the path of parent-child links from the tree root to the clade of choice:

def get_parent(tree, child_clade): node_path = tree.get_path(child_clade) return node_path[-2] # Select a clade myclade = tree.find_clades("foo").next() # Test the function parent = get_parent(tree, myclade) assert myclade in parent

Note that `get_path`

has a linear run time with respect to the size of the tree -- i.e. for best performance, don't call `get_parent`

or `get_path`

inside a time-critical loop. If possible, call `get_path`

outside the loop, and look up parents in the list returned by that function.

Alternately, if you need to repeatedly look up the parents of arbitrary tree elements, create a dictionary mapping all nodes to their parents:

def all_parents(tree): parents = {} for clade in tree.find_clades(order='level'): for child in clade: parents[child] = clade return parents # Example parents = all_parents(tree) myclade = tree.find_clades("foo").next() parent_of_myclade = parents[myclade] assert myclade in parent_of_myclade

### Index clades by name

For large trees it can be useful to be able to select a clade by name, or some other unique identifier, rather than searching the whole tree for it during each operation.

def lookup_by_names(tree): names = {} for clade in tree.find_clades(): if clade.name: if clade.name in names: raise ValueError("Duplicate key: %s" % clade.name) names[clade.name] = clade return names

Now you can retrieve a clade by name in constant time:

tree = Phylo.read('ncbi_taxonomy.xml', 'phyloxml') names = lookup_by_names(tree) for phylum in ('Apicomplexa', 'Euglenozoa', 'Fungi'): print "Phylum size", len(names[phylum].get_terminals())

A potential issue: The above implementation of lookup_by_names doesn't include unnamed clades, generally internal nodes. We can fix this by adding a unique identifier for each clade. Here, all clade names are prefixed with a unique number (which can be useful for searching, too):

def tabulate_names(tree): names = {} for idx, clade in enumerate(tree.find_clades()): if clade.name: clade.name = '%d_%s' % (idx, clade.name) else: clade.name = str(idx) names[clade.name] = clade return clade

### Calculate distances between neighboring terminals

*Suggested by Joel Berendzen*

import itertools def terminal_neighbor_dists(self): """Return a list of distances between adjacent terminals.""" def generate_pairs(self): pairs = itertools.tee(self) pairs[1].next() return itertools.izip(pairs[0], pairs[1]) return [self.distance(*i) for i in generate_pairs(self.find_clades(terminal=True))]

### Test for "semi-preterminal" clades

*Suggested by Joel Berendzen*

The existing tree method `is_preterminal`

returns True if all of the direct descendants are terminal. This snippet will instead return True if *any* direct descendent is terminal, but still False if the given clade itself is terminal.

def is_semipreterminal(clade): """True if any direct descendent is terminal.""" for child in clade: if child.is_terminal(): return True return False

In Python 2.5 and later, this is simplified with the built-in `any`

function:

def is_semipreterminal(clade): return any(child.is_terminal() for child in clade)

## Comparing trees

*TODO:*

- Symmetric difference / partition metric, a.k.a. topological distance (Robinson-Foulds)
- Quartets distance
- Nearest-neighbor interchange
- Path-length-difference

## Consensus methods

*TODO:*

- Majority-rules consensus
- Strict consensus
- Adams (Adams 1972)
- Asymmetric median tree (Phillips and Warnow 1996)

## Rooting methods

The basic method on the Tree class (not TreeMixin) is `root_with_outgroup`

:

tree = Phylo.read('example.nwk', 'newick') print tree # ... tree.root_with_outgroups({'name': 'A'}) # Operates in-place print tree

Normally you'll want the outgroup to be a monophyletic group, rather than a single taxon. This isn't automatically checked, but you can do it yourself with the `is_monophyletic`

method.

To save some typing, try keeping the query in a separate list and reusing it:

outgroup = [{'name': taxon_name} for taxon_name in ('E', 'F', 'G')] if tree.is_monophyletic(outgroup): tree.root_with_outgroup(*outgroup) else: raise ValueError("outgroup is paraphyletic")

*TODO:*

- Root at the midpoint between the two most distant nodes (or "center" of all tips)

## Graphics

*TODO:*

- Party tricks with
`draw_graphviz`

, covering each keyword argument

## Exporting to other types

### Convert to an 'ape' tree, via Rpy2

The R statistical programming environment provides support for phylogenetics through the 'ape' package and several others that build on top of 'ape'. The Python package rpy2 provides an interface between R and Python, so it's possible to convert a Bio.Phylo tree into an 'ape' tree object:

import tempfile from rpy2.robjects import r def to_ape(tree): """Convert a tree to the type used by the R package `ape`, via rpy2. Requirements: - Python package `rpy2` - R package `ape` """ with tempfile.NamedTemporaryFile() as tmpf: Phylo.write(tree, tmpf, 'newick') tmpf.flush() rtree = r(""" library('ape') read.tree('%s') """ % tmpf.name) return rtree

See that it works:

>>> from StringIO import StringIO >>> from Bio import Phylo >>> tree = Phylo.read(StringIO('(A,(B,C),(D,E));'), 'newick') >>> rtree = to_ape(tree) >>> len(rtree) 3 >>> print r.summary(rtree) Phylogenetic tree: structure(list(edge = structure(c(6, 6, 7, 7, 6, 8, 8, 1, 7, 2, 3, 8, 4, 5), .Dim = c(7L, 2L)), tip.label = c("A", "B", "C", "D", "E"), Nnode = 3L), .Names = c("edge", "tip.label", "Nnode" ), class = "phylo") Number of tips: 5 Number of nodes: 3 No branch lengths. No root edge. Tip labels: A B C D E No node labels. NULL >>> r.plot(rtree)

See the rpy2 documentation for further guidance.

### Convert to a PyCogent tree

The tree objects used by Biopython and PyCogent are different. Nonetheless, both toolkits support the Newick file format, so interoperability is straightforward at that level:

from Bio import Phylo import cogent Phylo.write(bptree, 'mytree.nwk', 'newick') # Biopython tree ctree = cogent.LoadTree('mytree.nwk') # PyCogent tree

*TODO:*

- Convert objects directly, preserving some PhyloXML annotations if possible

### Convert to a NumPy array or matrix

Adjacency matrix: cells are 1 (true) if a parent-child relationship exists, otherwise 0 (false).

import numpy def to_adjacency_matrix(tree): """Create an adjacency matrix (NumPy array) from clades/branches in tree. Also returns a list of all clades in tree ("allclades"), where the position of each clade in the list corresponds to a row and column of the numpy array: a cell (i,j) in the array is 1 if there is a branch from allclades[i] to allclades[j], otherwise 0. Returns a tuple of (allclades, adjacency_matrix) where allclades is a list of clades and adjacency_matrix is a NumPy 2D array. """ allclades = list(tree.find_clades(order='level')) lookup = {} for i, elem in enumerate(allclades): lookup[elem] = i adjmat = numpy.zeros((len(allclades), len(allclades))) for parent in tree.find_clades(terminal=False, order='level'): for child in parent.clades: adjmat[lookup[parent], lookup[child]] = 1 if not tree.rooted: # Branches can go from "child" to "parent" in unrooted trees adjmat += adjmat.transpose return (allclades, numpy.matrix(adjmat))

Distance matrix: cell values are branch lengths if a branch exists, otherwise infinity. (This plays well with graph algorithms.)

import numpy def to_distance_matrix(tree): """Create a distance matrix (NumPy array) from clades/branches in tree. A cell (i,j) in the array is the length of the branch between allclades[i] and allclades[j], if a branch exists, otherwise infinity. Returns a tuple of (allclades, distance_matrix) where allclades is a list of clades and distance_matrix is a NumPy 2D array. """ allclades = list(tree.find_clades(order='level')) lookup = {} for i, elem in enumerate(allclades): lookup[elem] = i distmat = numpy.repeat(numpy.inf, len(allclades)**2) distmat.shape = (len(allclades), len(allclades)) for parent in tree.find_clades(terminal=False, order='level'): for child in parent.clades: if child.branch_length: distmat[lookup[parent], lookup[child]] = child.branch_length if not tree.rooted: distmat += distmat.transpose return (allclades, numpy.matrix(distmat))

Enhancements:

- Use an OrderedDict for
`allclades`

, so the separate dictionary`lookup`

isn't needed. (Python 2.7+) - Use NumPy's record array to assign clade names to rows and columns of the matrix, so
`allclades`

isn't needed either. (This works nicely along with the`tabulate_names`

function given earlier.)

*TODO:*

- Relationship matrix? See Martins and Housworth 2002