PAML
m (Fixed the Codeml "read() and Results" section heading) |
|||
| Line 120: | Line 120: | ||
If the ''codeml'' process exits with an error, a PamlError exception will be raised. | If the ''codeml'' process exits with an error, a PamlError exception will be raised. | ||
| − | + | ===read() and Results=== | |
As previously stated, the Codeml.run() method returns a dictionary containing results parsed from ''codeml'''s main output file. Alternatively, an existing output file may be parsed using the read() function: | As previously stated, the Codeml.run() method returns a dictionary containing results parsed from ''codeml'''s main output file. Alternatively, an existing output file may be parsed using the read() function: | ||
<python>results = codeml.read()</python> | <python>results = codeml.read()</python> | ||
Revision as of 12:38, 29 June 2011
This module provides an interface to the PAML (Phylogenetic Analysis by Maximum Likelihood) package of programs. Currently, interfaces to the programs codeml, baseml and yn00 as well as a Python re-implementation of chi2 have been included.
Contents |
Availability
This module has not yet been integrated into Biopython and is available as a development branch of the source code, which may be downloaded at Github.
To use this module, you must have PAML version 4.1 or greater installed on your system.
codeml
The codeml interface is provided in the following module:
from Bio.Phylo.PAML import codeml
The Codeml object
Initialization
The interface is implemented as an object which maintains program options. In order to run codeml, typically one provides a control file which indicates the locations of an alignment file, a tree file and an output file, along with a series of options dictating how the software is to be run. In the Codeml object, the three file locations are stored as string attributes. The three file locations, as well as the location of the desired working directory, may be specified in the Codeml constructor as follows:
cml = codeml.Codeml(alignment = "align.phylip", tree = "species.tree", out_file = "results.out", working_dir = "./scratch")
They may also be set individually:
cml = codeml.Codeml() cml.alignment = "align.phylip" cml.tree = "species.tree" cml.out_file = "results.out" cml.working_dir = "./scratch"
Note that all file locations are converted to locations relative to the working directory. PAML programs have a limit on the lengths of file location strings; converting to relative locations will shorten these strings, allowing you generally to sidestep this limitation.
Options
The codeml runtime options are stored in a dictionary object that is keyed by the option names. For more information on the usage of these options, please refer to the PAML user manual.
The complete set of options and their current values may be printed:
>>> cml.print_options() verbose = None CodonFreq = None cleandata = None fix_blength = None NSsites = None fix_omega = None clock = None ncatG = None runmode = None fix_kappa = None fix_alpha = None Small_Diff = None method = None Malpha = None aaDist = None RateAncestor = None aaRatefile = None icode = None alpha = None seqtype = None omega = None getSE = None noisy = None Mgene = None kappa = None model = None ndata = None
Setting an option to a value of None will cause it to be ignored by codeml (it will be omitted from the final control file). Options may be set by the set_option() function and their values may be retrieved by the get_option() function:
>>> cml.set_option("clock", 1) >>> cml.set_option("NSsites", [0,1,2]) >>> cml.set_option("aaRatefile", "wag.dat") >>> cml.get_option("NSsites") [0, 1, 2]
The "NSsites" option deserves special attention: in a codeml control file, NSsites models are entered as a space-delimited list of numbers, such as 0 1 2, whereas in the Codeml object it is stored as a Python list.
Finally, options may be read in from an existing codeml control file or they may be written to a new file. Writing to a file isn't necessary, as this is done automatically when executing the run() method (see below). The control file to read is provided as an argument to the read_ctl_file() method, while the write_ctl_file() method writes to the Codeml object's ctl_file attribute
>>> cml.read_ctl_file("codeml.ctl") >>> cml.print_options() verbose = 1 CodonFreq = 2 cleandata = 1 fix_blength = None NSsites = 0 fix_omega = 0 clock = 0 ncatG = 8 runmode = 0 fix_kappa = 0 fix_alpha = 1 Small_Diff = 5e-07 method = 0 Malpha = 0 aaDist = 0 RateAncestor = 1 aaRatefile = dat/jones.dat icode = 0 alpha = 0.0 seqtype = 2 omega = 0.4 getSE = 0 noisy = 9 Mgene = 0 kappa = 2 model = 2 ndata = None >>> cml.ctl_file = "control2.ctl" >>> cml.write_ctl_file()
Running the program
Executing the object's run() method will run codeml with the current options and will return the parsed results in a dictionary object (see below). You can also pass a number of optional arguments to the run() method:
- verbose (boolean): codeml's screen output is suppresed by default; set this argument to True to see all of the output as it is generated. This is useful if codeml is failing and you need to see its error messages.
- parse (boolean): set to False to skip parsing the results. run() will instead return None
- ctl_file (string): provide a path to an existing control file to execute. The file is not parsed and read into Codeml's options dictionary. If set to None (default), the options dictionary is written to a control file, which is then used by codeml.
- command (string): provide a path to the codeml executable. This is set to "codeml" by default, so if the program is in your system path, that should suffice. If it's not in your system path or, for example, if you use multiple versions of PAML, you may instead provide the full path to the executable.
If the codeml process exits with an error, a PamlError exception will be raised.
read() and Results
As previously stated, the Codeml.run() method returns a dictionary containing results parsed from codeml's main output file. Alternatively, an existing output file may be parsed using the read() function:
results = codeml.read()
The results dictionary is organized hierarchically and is (in most cases) keyed in accordance with the terminology used in the output file. Numeric values are automatically converted to numeric Python types. As the output of codeml varies widely depending on the parameters used, the contents of the results dictionary will vary accordingly. Similarly, in the case of a runtime error, the results dictionary may be empty or missing contents. Thus, it is advisable to use Python's dict.get(key) method rather than dict[key] in order to handle missing elements gracefully.
All possible keys of the results dictionary are organized as follows:
- "version" : the program version number that generated the results
- "model" : the model description
- "codon model" : the codon frequency model used
- "site-class model" : the site-class model used, in the case of a single model
- "lnL max" : the maximum log-likelihood
- "NSsites" : NSsites results
- model_number (integer) : the NSsites model number
- "description" : description of the NSsites model
- "lnL" : log-likelihood
- "parameter list" : string list of model parameters (found beneath the "lnL" line in the output file)
- "SEs" : string list of standard error values
- "tree length" : tree length
- "dS tree" : tree with dS estimates as branch labels
- "dN tree" : tree with dN estimates as branch labels
- "omega tree" : tree with dN/dS estimates as branch labels
- "parameters" : model parameters
- "rates" : rate estimates
- "kappa" : kappa estimate
- "omega" : dN/dS estimate(s)
- "genes" : results for genes in files containing multiple alignments
- gene number (integer)
- "kappa" : gene kappa estimate
- "omega" : gene omega estimate
- gene number (integer)
- "dN" : dN estimate
- "dS" : dS estimate
- "site classes" : results for multiple site classes
- site class number (integer)
- "proportion" : proportion of sites belonging to this class
- "omega" : omega estimate for this class
- "branch types" : results for branch types under Clade Model C or Branch Site A
- branch type number (integer) : omega estimate for this branch type (Clade Model C)
- "foreground" : omega estimate for the foreground branch (Branch Site A)
- "background" : omega estimate for the background branch (Branch Site A)
- site class number (integer)
- "branches" : branch-specific results
- "branch" : branch description (ie "6..7")
- "t" : t estimate for this branch
- "N" : N estimate for this branch
- "S" : S estimate for this branch
- "omega" : omega estimate for this branch
- "dN" : dN estimate for this branch
- "dS" : dS estimate for this branch
- "N*dN" : N*dN for this branch
- "S*dS" : S*dS for this branch
- "branch" : branch description (ie "6..7")
- model_number (integer) : the NSsites model number
- "pairwise" : results for pairwise comparisons
- sequence1 : the first sequence's name
- sequence2 : the second sequence's name (note that order doesn't matter; seq1 vs seq2 and seq2 vs seq1 are both stored)
- "t" : t estimate for this comparison
- "S" : S estimate for this comparison
- "N" : N estimate for this comparison
- "omega" : dN/dS estimate for this comparison
- "dN" : dN estimate for this comparison
- "dS" : dS estimate for this comparison
- sequence2 : the second sequence's name (note that order doesn't matter; seq1 vs seq2 and seq2 vs seq1 are both stored)
- sequence1 : the first sequence's name
- "distances" : amino acid distances
- "raw" : raw amino acid distances
- sequence1 : the first sequence's name
- sequence2 : the distance to the second sequence's amino acid sequence
- "ml" : maximum likelihood amino acid distances
- sequence1 : the first sequence's name
- sequence2 : the distance to the second sequence's amino acid sequence
- "raw" : raw amino acid distances
baseml & yn00
Interfaces to baseml and yn00 are provided in the following modules:
from Bio.Phylo.PAML import baseml, yn00 bml = baseml.Baseml() yn = yn00.Yn00()
Baseml and Yn00 share the same methods and attributes as Codeml, and are thus used in the same manner. It should be noted, however, that Yn00 does not have a tree attribute, as yn00 does not require a tree file.
Baseml
The parameters available in the Baseml options are:
>>> bml.print_options() verbose = None cleandata = None fix_blength = None nparK = None model_options = None clock = None ncatG = None runmode = None fix_kappa = None fix_alpha = None noisy = None method = None fix_rho = None Malpha = None nhomo = None RateAncestor = None icode = None rho = None alpha = None getSE = None Small_Diff = None Mgene = None kappa = None model = None ndata = None
All of the possible contents of the Baseml results file are:
- "version" : the version number of Baseml used to generate the results
- "lnL max" : the maximum log-likelihood
- "lnL" : the log-likelihood
- "tree length" : the length of the estimated tree
- "tree" : the estimated tree
- "parameters" : model parameters
- "parameter list" : string list of model parameters (found beneath the "lnL" line in the output file)
- "SEs" : string list of standard error values
- "kappa" : kappa estimate(s)
- "branches" : branch-specific estimates
- branch number (integer)
- "t" : t estimate for this branch
- "kappa" : kappa estimate for this branch
- "TS" : TS estimate for this branch
- "TV" : TV estimate for this branch
- branch number (integer)
- "rate parameters" : list of rate parameters
- "rates" : list of rates
- "Q matrix" : Q matrix
- "matrix" : the actual matrix stored as a list of lists (4x4)
- "average Ts/Tv" : average Ts/Tv for the matrix
- "alpha" : alpha estimate
- "base frequencies" : dictionary of base frequencies
- base ("A", "T", "C" or "G") : frequency of base
- "rate frequencies" : list of rate frequencies
- "nodes" : branch-specific frequency parameters
- node number (integer)
- "root" : boolean indicating if the node is the root
- "frequency parameters" : list of frequency parameters for this node
- "base frequencies" : dictionary of base frequencies
- base ("A", "T", "C" or "G") : frequency of base
- node number (integer)
- "parameters" : model parameters
Yn00
The parameters available in the Yn00 options are:
>>> yn.print_options() commonf3x4 = None weighting = None icode = None ndata = None verbose = None
All of the possible contents of the Yn00 results file are:
- sequence1 : first sequence name
- sequence2 : second sequence name (order does not matter; both seq1 vs seq2 and seq2 vs seq1 are stored)
- "NG86" : results for the Nei & Gojobori (1986) method
- "omega" : dN/dS estimate
- "dN" : dN estimate
- "dS" : dS estimate
- "YN00" : results for the Yang & Nielsen (2000) method
- "S" : S estimate
- "N" : N estimate
- "t" : t estimate
- "kappa" : kappa estimate
- "omega" : dN/dS estimate
- "dN" : dN estimate
- "dN SE": dN standard error
- "dS" : dS estimate
- "dS SE" : dS standard error
- "LWL85" : results for the LWL85 method
- "dS" : dS estimate
- "dN" : dN estimate
- "omega" : dN/dS estimate
- "S" : S estimate
- "N" : N estimate
- "LWL85m" : results for the LWL85m method
- "dS" : dS estimate
- "dN" : dN estimate
- "omega" : dN/dS estimate
- "S" : S estimate
- "N" : N estimate
- "rho" : rho estimate
- "LPB93" : results for the LPB93 method
- "dS" : dS estimate
- "dN" : dN estimate
- "omega" : dN/dS estimate
- "NG86" : results for the Nei & Gojobori (1986) method
- sequence2 : second sequence name (order does not matter; both seq1 vs seq2 and seq2 vs seq1 are stored)
chi2
The chi2 module offers an easy method to retrieve p-values from a Chi-squared cumulative distribution function for likelihood ratio tests, which are performed frequently when using PAML programs. As of the current version of PAML, the chi2 program does not allow passing both a test statistic and the degrees of freedom as command-line arguments. Thus, the chi2 module currently consists of a re-implementation of Ziheng Yang's original code in pure Python. For most cases, this should not affect you, however using very large numbers of degrees of freedom, such as in the FMutSel vs FMutSel0 model test in codeml (41 degrees of freedom), this Python version runs significantly slower than the original.
To retrieve a p-value, simply import the module
from Bio.Phylo.PAML.chi2 import cdf_chi2
And use the cdf_chi2() function:
>>> df = 2 >>> statistic = 7.21 >>> cdf_chi2(df, statistic) 0.027187444813595696