PopGen Genepop

(Difference between revisions)
Jump to: navigation, search
(Statistics)
Line 56: Line 56:
  
 
The number of alleles is simply getting len(allele_list).
 
The number of alleles is simply getting len(allele_list).
 
 
  
 
It is also possible to get the list of all alleles of a certain locus for all populations:
 
It is also possible to get the list of all alleles of a certain locus for all populations:
Line 79: Line 77:
 
3 and 11.3% are 20.
 
3 and 11.3% are 20.
  
We can get similar information for genotypes (diploid data)
+
We can get similar information for genotypes (diploid data). Expected frequencies will also be reported:
 +
 
 
<python>
 
<python>
 
genotype_list = ctrl.get_genotype_frequency(0, "Locus2")
 
genotype_list = ctrl.get_genotype_frequency(0, "Locus2")
 
</python>
 
</python>
  
 +
genotype_list will be:
 +
[(3, 3, 24, 24.3443), (20, 3, 7, 6.3114999999999997), (20, 20, 0, 0.34429999999999999)]
  
 +
Lets interpret the first element: There are 24 individuals which have a genotype of (3, 3), whereas the expected number of individuals with that genotype is 24.2443.
 +
 +
===Fis===
  
 
We will now get the Fis of a certain locus/population plus a few other statistics:
 
We will now get the Fis of a certain locus/population plus a few other statistics:
Line 108: Line 112:
 
allele_dict holds for each allele (being each allele the key), number of repetitions of the allele, frequency and Cockerham and Weir Fis.
 
allele_dict holds for each allele (being each allele the key), number of repetitions of the allele, frequency and Cockerham and Weir Fis.
  
So, from the above results the following can be read: there are 62 genes with 2 different allees (55 are of type 3, and 7 of type 20). 3 has frequency 0.89 and 20 0.11. All CW Fis are -0.111 and the RH Fis is -0.112.
+
So, from the above results the following can be read: there are 62 genes with 2 different alleles (55 are of type 3, and 7 of type 20). 3 has frequency 0.89 and 20 0.11. All CW Fis are -0.111 and the RH Fis is -0.112.
  
If the objective is just to get allele frequencies, then there is an easier way:
+
===Migration===
 
+
<python>
+
total_alleles, count = ctrl.get_allele_frequency(0,"Locus2")
+
</python>
+
  
total_alleles returns the number of total alleles and count is a dictionary whose key is the allele id and the value is the frequency.
+
We can get an estimation of the number of migrants:
  
 
<python>
 
<python>
print ctrl.estimate_nm()
+
samp_size, priv_allele_freq, mig10, mig25, mig25, migcorr = ctrl.estimate_nm()
 
</python>
 
</python>
  

Revision as of 11:51, 9 August 2009

Two interfaces are supplied: A general, more complex and more efficient one (GenePopController) and a simplified, more easy to use, not complete and not so efficient version (EasyController). EasyController might not be able to handle very large files, by virtue of its interface, on the other hand it provides utility functions to compute some very simple statistics like allele counts, which are not directly available in the general interface.

The more complex interface assumes more proficient Python developers (e.g., by the use of iterators) and for now it is not documented. But even for experienced Python developers, EasyController can be convenient as long as the required functionality is exposed in EasyController and its performance is deemed acceptable.

In order for the controllers to be used, Genepop has to be installed in the system, it can be downloaded from here.

Contents

EasyController tutorial

Before we start, lets test the installation (for this you need a genepop formated file):

from Bio.PopGen.GenePop.EasyController import EasyController
 
ctrl = EasyController(your_file_here)
print ctrl.get_basic_info()

Replace your_file_here with the name and path to your file. If you get a IOError: Genepop not found then Biopython cannot find your Genepop executable. If Genepop is not on the PATH, you can add it to the constructor line, i.e.

ctrl = EasyController(your_file_here, path_to_genepop_here)

If everything is working, now we can go on and use Genepop. For the examples below, we will use the genepop file big.gen made available with the unit tests. We will also assume that there is a ctrl object initialized with the relevant file chosen.

We start by getting some basic info

pop_names, loci_names = ctrl.get_basic_info()

Returns the list of population names and loci names available on the file.

Caveat: Most existing Genepop files provide erroneous data regarding population names. In many cases that information might not be trusted. Assessing population information is, most of the times, done by the relative position of the population in the file, not the name. So the first population is the file is index 0, the second index 1, and so on...

Statistics

Heterozygosity

Lets get heterozygosity info for a certain population and a certain allele:

(exp_homo, obs_homo, exp_hetero, obs_hetero) = ctrl.get_heterozygosity_info(0,"Locus2")

Will get expected and observed homozygosity and heterozygosity for population 0 and Locus2 (of the file big.gen, if you are using another file, adjust the population position and locus name accordingly).

Existing alleles

It is possible to get the list of all alleles of a certain locus in a certain population:

allele_list = ctrl.get_alleles(0,"Locus2")

allele_list will be [3, 20] (i.e., alleles 3 and 20 are on the population).

The number of alleles is simply getting len(allele_list).

It is also possible to get the list of all alleles of a certain locus for all populations:

all_allele_list = ctrl.get_alleles_all_pops("Locus2")

all_allele_list will be [3, 20].


Allele and genotype frequencies

It is possible to get the frequency of alleles in a certain population

allele_data = ctrl.get_allele_frequency(0, "Locus2")

allele_data will be (62, {3: 0.88700000000000001, 20: 0.113}). That is there are 62 genes. 88.7% are 3 and 11.3% are 20.

We can get similar information for genotypes (diploid data). Expected frequencies will also be reported:

genotype_list = ctrl.get_genotype_frequency(0, "Locus2")

genotype_list will be: [(3, 3, 24, 24.3443), (20, 3, 7, 6.3114999999999997), (20, 20, 0, 0.34429999999999999)]

Lets interpret the first element: There are 24 individuals which have a genotype of (3, 3), whereas the expected number of individuals with that genotype is 24.2443.

Fis

We will now get the Fis of a certain locus/population plus a few other statistics:

allele_dict, summary_fis =  ctrl.get_fis(0,"Locus2")

Lets have a detailed look the output of get_fis:

summary_fis = (62, -0.1111, -0.11269999999999999)
 
allele_dict = {
    3: (55, 0.8871, -0.1111),
    20: (7, 0.1129, -0.1111)
}

summary_fis holds a triple with: total number of alleles, Cockerham and Weir Fis, Robertson and Hill Fis.

allele_dict holds for each allele (being each allele the key), number of repetitions of the allele, frequency and Cockerham and Weir Fis.

So, from the above results the following can be read: there are 62 genes with 2 different alleles (55 are of type 3, and 7 of type 20). 3 has frequency 0.89 and 20 0.11. All CW Fis are -0.111 and the RH Fis is -0.112.

Migration

We can get an estimation of the number of migrants:

samp_size, priv_allele_freq, mig10, mig25, mig25, migcorr = ctrl.estimate_nm()
print ctrl.get_avg_fst_pair_locus("Locus4")
print ctrl.get_avg_fst_pair()
print ctrl.get_avg_fis()


print ctrl.get_multilocus_f_stats()
print ctrl.get_f_stats("Locus2")

Tests

Tests are normally computationally intensive as they are normally based on a Markov Chain algorithm. In some cases full enumeration approaches are available but those can only be applied for locus with a very low number of alleles. This means that most tests will take quite some time to complete.

For more details about Markov Chain parameters below (dememorization, batched and iterations) please consult the Genepop manual. Also consult the manual to understand when full enumeration is applicable.


Lets start by testing Hardy-Weinberg equilibrium for each loci in each population:

ctrl.test_hw(1, "excess")

The second parameter can by probability, excess or deficiency. probability is the standard Haldane HW test. Use deficiency when you are interested in heterozygote deficiency or excess if you are interested in excess.


pop_test, loc_test, all_test = ctrl.test_hw_global("deficiency")

Use deficiency when you are interested in heterozygote deficiency or excess if you are interested in excess. probability does not apply here like in test_hw.

Both pop_test


print ctrl.test_ld_all_pair("Locus1", "Locus2",
    dememorization=1000, batches=10, iterations=100)

Isolation By Distance (IBD)

Isolation By Distance (IBD) analysis requires a special form of Genepop files:

  1. One individual per population
  2. The name of the individual has to be its coordinates

Example:

...
Pop
0 15, 0201  0303 0102 0302 1011
Pop
0 30, 0202  0301 0102 0303 1111
Pop
0 45, 0102  0401 0202 0102 1010
Pop
0 60, 0103  0202 0101 0202 1011
Pop
0 75, 0203  0204 0101 0102 1010
POP
15 15, 0102 0202 0201 0405 0807
...

Note that the example file that we are using, cannot be used for this case.

There is a single call for IBD analysis (note that you :

estimate, distance, (a, b), (bb, bblow, bbhigh) = \
    ctrl.calc_ibd(self, is_diplo = True, stat="a", scale="Log", min_dist=0.00001)

is_diplo specifies if data is diploid (True) or haploid (False).

stat is either a or e (see the Genepop manual for details.

scale is either Log or Linear . Log is used for 2D coordinates and Linear for 1D.

Only pairwise comparisons above min_dist are used to compute regression coefficients.

The method returns:

estimate, a triangular matrix containing genetic distances among samples according to the chosen statistic.

distance, a triangular matrix containing distances (log or linear) among smamples.

a and b are the parameter fits for the regression. bblow and bbhigh are the bootstrap confidence intervals for the b parameter (bb should be very close to b).

Interpretation of the triangular matrices should be done like this: Pythonwise, a matrix is implemented with a list of lists of numbers, like this

[
   [0.1],
   [0.2, 0.3],
   [0.4, 0.5, 0.6]
]

The above data structure corresponds to the following triangular matrix

      1    2    3
2   0.1
3   0.2  0.3
4   0.4  0.5  0.6
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox