PopGen dev Statistics
Developing Statistics for the population genetics module
A few observations about population genetics and bioinformatics
- Population genetics is used in a wide variety of settings, from cancer studies to conservation of endangered species. Even names might not be standartized, e.g. a human geneticist might use Short Tandem Repeat(STR) while an animal geneticist might use microsatellite.
- Different research communities have completely different sets of requirements: Markers change (SNPs, Microsatellites/STSs, RFLPs, sequences); number of individuals sampled change (for few units, to thousands to very large synthetic datasets); Number of populations change; The density of markers change (from 10 million SNPs in the HapMap for humans and full sequences appearing for many individuals on parasites, to the opposite, where only 20-30 loci are available)
- A good implementation should support the vast majority of scenarios above: should support as many markers as possible, big and small datasets
- The most used format in population genetics is still the genepop format. This format is not used much outside the popgen community. For an overview of the importance of the format, read Excoffier and Heckel http://www.nature.com/nrg/journal/v7/n10/full/nrg1904.html, Nature Reviews Genetics. This happens to be a marker-independent format.
- The popgen community while producing lots of free software cares very little about licensing or data-standardization issues. Different ad-hoc formats exist (again, see the above paper).
As such it is important to characterize all types of existing statistics and to create a framework that accommodates all of them if people decide to implement them in the future.
Restricted use cases should be handled with extreme care. If possible people from different backgrounds should contribute.
Here we characterize the different dimensions that need to be accounted for. For a good intro see page 89 of the Arlequin3 manual
Marker dependent versus marker independent statistics
Some statistics require a special kind of marker. For instance Tajima D requires a sequence or a RFLP. Allelic range requires microsatellites/STRs. Other statistics are marker independent. For instance Fst relies on allele counts per population.
Intra-Population versus Inter-population statistics
As an example observed heterozygosity has no notion of population structure.
Fst is an example of a statistic that exists to measure population differentiation. These kind of statistics require some notion of population differentiation.
haplotypic, genotypic (phase unknown), genotypic (phase known), genoptypic dominant, frequency only.
Say, for expected heterozygosity frequencies are enough, for observed heterozygosity genotypic (phase unknown) data is necessary.
Single locus versus multi-loci
Single locus as in allelic richness, ExpHe, Fst.
Multi-loci as in number of polimorphic sites, LD or EHH.
Temporal/longitudinal vs single point in time
Say temporal-Fst versus Fst.
Population versus Landscape
This issue I suggest abandon for now.
Example of statistic classification
- ExpHz non-temporal, intra, single-locus, marker independent, genotypic - gametic unk
- ObsHz non-temporal, intra, single-locus, independent, genotypic - gametic kn
- Fst(CW) non-temporal, inter, single-locus, indep, genotypic - gametic unk
- temporal-Fst temporal, intra, single-locus, indep, genotypic - gametic unk
- LD(D') non-temporal, intra, multi-locus, indep, haplo/geno
- Fk temporal, intra, single-locus, indep, geno
- S (polimorphic sites), non-temporal, intra, multi-locus, indep, haplo/geno
- Alleic range, nt, intra, single-locus, microsat, haplo/geno
- EHH, nt, positional
- Tajima D, nt, intra, single-locus, sequence/rflp
The design will have to be able to cope with all the dimensions above
Existing development code can be found here. If you want to work on statistics feel free to add your github repository here.
There is still the issue of statistical tests (say Hardy-Weinberg deviation).