Molecular population genetic analysis. More...
#include <Sequence/PolySNP.hpp>
Public Member Functions | |
| PolySNP (const Sequence::PolyTable *data, bool haveOutgroup=false, unsigned outgroup=0, bool totMuts=true) | |
| virtual double | ThetaPi (void) |
| virtual double | ThetaW (void) |
| virtual double | ThetaH (void) |
| virtual double | ThetaL (void) |
| double | VarPi (void) |
| double | StochasticVarPi (void) |
| double | SamplingVarPi (void) |
| double | VarThetaW (void) |
| unsigned | NumPoly (void) |
| virtual unsigned | NumMutations (void) |
| virtual unsigned | NumSingletons (void) |
| virtual unsigned | NumExternalMutations (void) |
| virtual double | TajimasD (void) |
| virtual double | Hprime (bool likeThorntonAndolfatto=false) |
| virtual double | Dnominator (void) |
| virtual double | FuLiD (void) |
| virtual double | FuLiF (void) |
| virtual double | FuLiDStar (void) |
| virtual double | FuLiFStar (void) |
| double | DandVH (void) |
| unsigned | DandVK (void) |
| virtual double | WallsB (void) |
| virtual unsigned | WallsBprime (void) |
| virtual double | WallsQ (void) |
| double | HudsonsC (void) |
| virtual unsigned | Minrec (void) |
| std::vector< std::vector < double > > | Disequilibrium (const unsigned &mincount=1, const double &max_marker_distance=std::numeric_limits< double >::max()) |
Protected Member Functions | |
| void | DepaulisVeuilleStatistics (void) |
| virtual void | WallStats (void) |
| double | a_sub_n (void) |
| double | a_sub_n_plus1 (void) |
| double | b_sub_n (void) |
| double | b_sub_n_plus1 (void) |
| double | c_sub_n (void) |
| double | d_sub_n (void) |
Protected Attributes | |
| std::auto_ptr< _PolySNPImpl > | rep |
Molecular population genetic analysis.
Example Use:
#include <iostream> #include <vector> #include <Sequence/PolySNP.hpp> #include <Sequence/Fasta.hpp> #include <Sequence/PolySites.hpp> using namespace std; using namespace Sequence; int main { vector<Fasta> data; Alignment::GetData(data,"popdata.fasta"); assert(Alignment::IsAlignment(data)) if (Alignment::Gapped(data)) { Alignment::RemoveTerminalGaps(data); } PolySites *polytable = new PolySites(data); PolySNP *analyze = new PolySNP(data,false,0); cout << "Tajima's D is " << analyze->TajimasD() << endl; delete polytable; delete analyze; exit(1); }
N character). However, all summary statistics involved in "tests of
neutrality" are, strictly speaking, undefined if missing data are present. The reason for this is because the denominators of the statistics are functions of the sample sizes, and no explicit formulae exist when the sample size varies from site to site (which is the case when there are missing data). In short, if you want to be rigorous, you can only really count up nucleotide diversity and a few other statistics if your data contain untyped SNPs. However, the routines present in libsequence will happily go and calculate the summary statistics for you, and it is up to you to be aware that you are writing a program that may give biased results. To date, the magnitude and direction of the bias remains unknown. Functions (and hence the statistics) that are affected have warnings in their documentation. Definition at line 83 of file PolySNP.hpp.
| Sequence::PolySNP::PolySNP | ( | const Sequence::PolyTable * | data, | |
| bool | haveOutgroup = false, |
|||
| unsigned | outgroup = 0, |
|||
| bool | totMuts = true | |||
| ) | [explicit] |
| data | a valid object of type Sequence::PolyTable | |
| haveOutgroup | true if an outgroup is present, false otherwise | |
| outgroup | if haveOutgroup is true, outgroup is the index of that sequence in data | |
| totMuts | if true (the default) use the total number of inferred mutations, otherwise use the total number of polymorphic sites in calculations |
Definition at line 157 of file PolySNP.cc.
| double Sequence::PolySNP::a_sub_n | ( | void | ) | [protected] |
This is the denominator of Watterson's Theta (see PolySNP::ThetaW)
Definition at line 1149 of file PolySNP.cc.
| double Sequence::PolySNP::a_sub_n_plus1 | ( | void | ) | [protected] |
| double Sequence::PolySNP::b_sub_n | ( | void | ) | [protected] |
| double Sequence::PolySNP::b_sub_n_plus1 | ( | void | ) | [protected] |
| double Sequence::PolySNP::c_sub_n | ( | void | ) | [protected] |
| double Sequence::PolySNP::d_sub_n | ( | void | ) | [protected] |
| double Sequence::PolySNP::DandVH | ( | void | ) |
To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used thestd::string comparison operators == or !=)
Definition at line 1256 of file PolySNP.cc.
| unsigned Sequence::PolySNP::DandVK | ( | void | ) |
To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)
Definition at line 1273 of file PolySNP.cc.
| void Sequence::PolySNP::DepaulisVeuilleStatistics | ( | void | ) | [protected] |
Calculate the number of haplotypes in the sample, and haplotype diversity. Unlike Depaulis and Veuille's original paper, this routine uses an unbiased calculation of haplotype diversity (i.e. divide by n choose 2).
To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)
Definition at line 753 of file PolySNP.cc.
| std::vector< std::vector< double > > Sequence::PolySNP::Disequilibrium | ( | const unsigned & | mincount = 1, |
|
| const double & | max_marker_distance = std::numeric_limits<double>::max() | |||
| ) |
| mincount | a frequency filter. A polymorphism must be present at least mincount times in the data |
Definition at line 1420 of file PolySNP.cc.
| double Sequence::PolySNP::Dnominator | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 719 of file PolySNP.cc.
| double Sequence::PolySNP::FuLiD | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 1009 of file PolySNP.cc.
| double Sequence::PolySNP::FuLiDStar | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 1074 of file PolySNP.cc.
| double Sequence::PolySNP::FuLiF | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 1038 of file PolySNP.cc.
| double Sequence::PolySNP::FuLiFStar | ( | void | ) | [virtual] |
Fu and Li (1993) F* statistic. Incorporates correction from Simonsen et al. (1995) Genetics 141: 413, eqn A5.
Reimplemented in Sequence::PolySIM.
Definition at line 1108 of file PolySNP.cc.
| double Sequence::PolySNP::Hprime | ( | bool | likeThorntonAndolfatto = false |
) | [virtual] |
| likeThorntonAndolfatto | The calculation of H' requires calculation of . In Thornton and Andolfatto, we simply used , which is slightly biased. By default, this function calculates , unless this bool is set to false, in which case is used. |
Reimplemented in Sequence::PolySIM.
Definition at line 668 of file PolySNP.cc.
| double Sequence::PolySNP::HudsonsC | ( | void | ) |
, an estimator of the population recombination rate that depends on the variance of the site frequencies. The calculation is made by a call to Recombination::HudsonsC Definition at line 1290 of file PolySNP.cc.
| unsigned Sequence::PolySNP::Minrec | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 1306 of file PolySNP.cc.
| unsigned Sequence::PolySNP::NumExternalMutations | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 616 of file PolySNP.cc.
| unsigned Sequence::PolySNP::NumMutations | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 564 of file PolySNP.cc.
| unsigned Sequence::PolySNP::NumPoly | ( | void | ) |
Definition at line 547 of file PolySNP.cc.
| unsigned Sequence::PolySNP::NumSingletons | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 583 of file PolySNP.cc.
| double Sequence::PolySNP::SamplingVarPi | ( | void | ) |
Component of variance of mean pairwise differences from sampling. Tajima in Takahata/Clark book, (15)
Definition at line 976 of file PolySNP.cc.
| double Sequence::PolySNP::StochasticVarPi | ( | void | ) |
Stochastic variance of mean pairwise differences. Tajima in Takahata/Clark book, (14).
Definition at line 962 of file PolySNP.cc.
| double Sequence::PolySNP::TajimasD | ( | void | ) | [virtual] |
A common summary of the site frequency spectrum. Proportional to
. This routine does calculate the denominator of the test statistic.
Reimplemented in Sequence::PolySIM.
Definition at line 647 of file PolySNP.cc.
| double Sequence::PolySNP::ThetaH | ( | void | ) | [virtual] |
Calculate Theta ( = 4Nu) from site homozygosity, a la Fay and Wu (2000). This statistic is problematic in general to calculate when there are multiple hits. The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.
If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T
This keeps the expectation of the statistic equal to
, and uses the correct number of derived mutations ovserved in the data.
Reimplemented in Sequence::PolySIM.
Definition at line 292 of file PolySNP.cc.
| double Sequence::PolySNP::ThetaL | ( | void | ) | [virtual] |
Calculate Theta ( = 4Nu) from site homozygosity, corresponding to equation 1 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence
for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster," (although we labelled in
in that paper) The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.
If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T
This keeps the expectation of the statistic equal to
, and uses the correct number of derived mutations ovserved in the data.
Reimplemented in Sequence::PolySIM.
Definition at line 423 of file PolySNP.cc.
| double Sequence::PolySNP::ThetaPi | ( | void | ) | [virtual] |
Calculated here as the sum of 1.0 - sum of site homozygosity accross sites.
Where
is the number of segregating sites,
is the number of occurences of the
character state at site
, and
is the sample size at site
. Calculating the statistic in this manner makes it easy to generalize to an arbitrary number of character states per polymorphic site
Also equivalent to sum of site heterozygosities:
Also equivalent to mean pairwise differences, but that's slow to calculate.
If there is missing data (indicated by 'N' characters), the sample size is reduced for that site. For example, if the data for the
site is:
A
A
A
N
N
G
Then ThetaPi is calculated for that site as if the sample size were 4 (not 6), and the polymorphic site frequencies are 3/4 for A and 1/4 for G
Reimplemented in Sequence::PolySIM.
Definition at line 176 of file PolySNP.cc.
| double Sequence::PolySNP::ThetaW | ( | void | ) | [virtual] |
The classic "Watterson's Theta" statistic, generalized to missing data and multiple mutations per site:
For this statistic,
is either the number of segregating sites, or the number of mutations on the genealogy and
is the sample size at site i. If totMuts == 1, the number of mutations is used, else the number ofsegregating sites is used.
Reimplemented in Sequence::PolySIM.
Definition at line 246 of file PolySNP.cc.
| double Sequence::PolySNP::VarPi | ( | void | ) |
Total variance of mean pairwise differences. Tajima in Takahata/Clark book, (13).
Definition at line 948 of file PolySNP.cc.
| double Sequence::PolySNP::VarThetaW | ( | void | ) |
Definition at line 993 of file PolySNP.cc.
| double Sequence::PolySNP::WallsB | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 827 of file PolySNP.cc.
| unsigned Sequence::PolySNP::WallsBprime | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 917 of file PolySNP.cc.
| double Sequence::PolySNP::WallsQ | ( | void | ) | [virtual] |
Reimplemented in Sequence::PolySIM.
Definition at line 932 of file PolySNP.cc.
1.6.3