Sequence::PolySNP Class Reference
[Analysis of molecular population genetic data]

Molecular population genetic analysis. More...

#include <Sequence/PolySNP.hpp>

Inheritance diagram for Sequence::PolySNP:
Sequence::PolySIM

List of all members.

Public Member Functions

 PolySNP (const Sequence::PolyTable *data, bool haveOutgroup=false, unsigned outgroup=0, bool totMuts=true)
virtual double ThetaPi (void)
virtual double ThetaW (void)
virtual double ThetaH (void)
virtual double ThetaL (void)
double VarPi (void)
double StochasticVarPi (void)
double SamplingVarPi (void)
double VarThetaW (void)
unsigned NumPoly (void)
virtual unsigned NumMutations (void)
virtual unsigned NumSingletons (void)
virtual unsigned NumExternalMutations (void)
virtual double TajimasD (void)
virtual double Hprime (bool likeThorntonAndolfatto=false)
virtual double Dnominator (void)
virtual double FuLiD (void)
virtual double FuLiF (void)
virtual double FuLiDStar (void)
virtual double FuLiFStar (void)
double DandVH (void)
unsigned DandVK (void)
virtual double WallsB (void)
virtual unsigned WallsBprime (void)
virtual double WallsQ (void)
double HudsonsC (void)
virtual unsigned Minrec (void)
std::vector< std::vector
< double > > 
Disequilibrium (const unsigned &mincount=1, const double &max_marker_distance=std::numeric_limits< double >::max())

Protected Member Functions

void DepaulisVeuilleStatistics (void)
virtual void WallStats (void)
double a_sub_n (void)
double a_sub_n_plus1 (void)
double b_sub_n (void)
double b_sub_n_plus1 (void)
double c_sub_n (void)
double d_sub_n (void)

Protected Attributes

std::auto_ptr< _PolySNPImplrep

Detailed Description

Molecular population genetic analysis.

Example Use:

  #include <iostream>
  #include <vector>
  #include <Sequence/PolySNP.hpp>
  #include <Sequence/Fasta.hpp>
  #include <Sequence/PolySites.hpp>
  using namespace std;
  using namespace Sequence;
  int main
  {
    vector<Fasta> data;
    Alignment::GetData(data,"popdata.fasta");
    assert(Alignment::IsAlignment(data))
    if (Alignment::Gapped(data))
    {
      Alignment::RemoveTerminalGaps(data);
    }
    PolySites *polytable = new PolySites(data);
    PolySNP *analyze = new PolySNP(data,false,0);
    cout << "Tajima's D is " << analyze->TajimasD() << endl;
    delete polytable;
    delete analyze;
    exit(1);
  }
Warning:
The routines to calculate nucleotide diversity, numbers of singletons, etc. all handle missing data (the N character). However, all summary statistics involved in "tests of neutrality" are, strictly speaking, undefined if missing data are present. The reason for this is because the denominators of the statistics are functions of the sample sizes, and no explicit formulae exist when the sample size varies from site to site (which is the case when there are missing data). In short, if you want to be rigorous, you can only really count up nucleotide diversity and a few other statistics if your data contain untyped SNPs. However, the routines present in libsequence will happily go and calculate the summary statistics for you, and it is up to you to be aware that you are writing a program that may give biased results. To date, the magnitude and direction of the bias remains unknown. Functions (and hence the statistics) that are affected have warnings in their documentation.
Note:
As of libsequence 1.4.1, the routines in this class explicity check for gaps in the polymorphism table. This provides an additional safeguard for cases where Sequence::PolyTable objects are created and sites with gaps are left in.
Examples:

slidingWindow.cc, and slidingWindow2.cc.

Definition at line 83 of file PolySNP.hpp.


Constructor & Destructor Documentation

Sequence::PolySNP::PolySNP ( const Sequence::PolyTable data,
bool  haveOutgroup = false,
unsigned  outgroup = 0,
bool  totMuts = true 
) [explicit]
Parameters:
data a valid object of type Sequence::PolyTable
haveOutgroup true if an outgroup is present, false otherwise
outgroup if haveOutgroup is true, outgroup is the index of that sequence in data
totMuts if true (the default) use the total number of inferred mutations, otherwise use the total number of polymorphic sites in calculations
Note:
this constructor allocates a pointer to an implementation class, which automatically pre-processes the SNP data

Definition at line 157 of file PolySNP.cc.


Member Function Documentation

double Sequence::PolySNP::a_sub_n ( void   )  [protected]

\[a_n=\sum_{i=1}^{i=n-1}\frac{1}{i}.\ \]

This is the denominator of Watterson's Theta (see PolySNP::ThetaW)

Warning:
statistic undefined if there are untyped SNPs

Definition at line 1149 of file PolySNP.cc.

double Sequence::PolySNP::a_sub_n_plus1 ( void   )  [protected]

\[a_{n+1}=\sum_{i=1}^{i=n}\frac{1}{i}\ \]

Warning:
statistic undefined if there are untyped SNPs

Definition at line 1165 of file PolySNP.cc.

double Sequence::PolySNP::b_sub_n ( void   )  [protected]

\[b_n=\sum_{i=1}^{i=n-1}\frac{1}{i^2}\ \]

Warning:
statistic undefined if there are untyped SNPs

Definition at line 1182 of file PolySNP.cc.

double Sequence::PolySNP::b_sub_n_plus1 ( void   )  [protected]

\[b_n=\sum_{i=1}^{i=n}\frac{1}{i^2}\ \]

Warning:
statistic undefined if there are untyped SNPs
Author:
Joshua Shapiro

Definition at line 1197 of file PolySNP.cc.

double Sequence::PolySNP::c_sub_n ( void   )  [protected]

\[ c_n=\left\{\begin{array}{cl} 1 , & when\ n = 2 \\ \frac{2 \times (n \times a_n - 2 \times (n-1))}{(n-1) \times (n-2)}, & when\ n > 2 \\ \end{array}\right.\ \]

Warning:
statistic undefined if there are untyped SNPs

Definition at line 1213 of file PolySNP.cc.

double Sequence::PolySNP::d_sub_n ( void   )  [protected]

\[\ d_n=\frac{2}{n-1} \times (1.5 - \frac{2 \times a_{n+1}}{n-2} - \frac{1}{n})\ \]

Warning:
statistic undefined if there are untyped SNPs

Definition at line 1239 of file PolySNP.cc.

double Sequence::PolySNP::DandVH ( void   ) 

To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used thestd::string comparison operators == or !=)

Returns:
the haplotype diversity of the data.
Examples:
bottleneck.cc.

Definition at line 1256 of file PolySNP.cc.

unsigned Sequence::PolySNP::DandVK ( void   ) 

To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)

Returns:
number of haplotypes in the sample

Definition at line 1273 of file PolySNP.cc.

void Sequence::PolySNP::DepaulisVeuilleStatistics ( void   )  [protected]

Calculate the number of haplotypes in the sample, and haplotype diversity. Unlike Depaulis and Veuille's original paper, this routine uses an unbiased calculation of haplotype diversity (i.e. divide by n choose 2).
To check if two sequences are unique, Sequence::Comparisons::Different is used, which does not allow missing data to result in 2 sequences being considered different (as they would be if you simply used the std::string comparison operators == or !=)

Definition at line 753 of file PolySNP.cc.

std::vector< std::vector< double > > Sequence::PolySNP::Disequilibrium ( const unsigned &  mincount = 1,
const double &  max_marker_distance = std::numeric_limits<double>::max() 
)
Returns:
A vector of statistics related to LD and distance in the sample. An empty vector is returned if there are < 2 polymorphic sites in the sample. See the documentation for Recombination::Disequilibrium for a description of the return vector.
Parameters:
mincount a frequency filter. A polymorphism must be present at least mincount times in the data
Note:
For D and D', the 11 gamete is defined as follows: If no outgroup is present, it refers to the genotype of minor alleles at both sites. If there is an outgroup, it is based on the genotype of derived alleles at both sites.

Definition at line 1420 of file PolySNP.cc.

double Sequence::PolySNP::Dnominator ( void   )  [virtual]
Warning:
statistic undefined if there are untyped SNPs
Returns:
Denominator of Tajima's D, or nan if there are no polymorphic sites

Reimplemented in Sequence::PolySIM.

Definition at line 719 of file PolySNP.cc.

double Sequence::PolySNP::FuLiD ( void   )  [virtual]
Returns:
The Fu and Li (1993) D statistic, or nan if there are no polymorphic sites.
Note:
For sequence data, an outgroup is required. This requirement is checked by assert()
Warning:
statistic undefined if there are untyped SNPs

Reimplemented in Sequence::PolySIM.

Definition at line 1009 of file PolySNP.cc.

double Sequence::PolySNP::FuLiDStar ( void   )  [virtual]
Warning:
statistic undefined if there are untyped SNPs
Returns:
Fu and Li (1993) D*, or nan if there are no polymorphic sites

Reimplemented in Sequence::PolySIM.

Definition at line 1074 of file PolySNP.cc.

double Sequence::PolySNP::FuLiF ( void   )  [virtual]
Returns:
Fu and Li (1993) F statistic, or nan if there are no polymorphic sites
Note:
For sequence data, an outgroup is required, else undefined
Warning:
statistic undefined if there are untyped SNPs

Reimplemented in Sequence::PolySIM.

Definition at line 1038 of file PolySNP.cc.

double Sequence::PolySNP::FuLiFStar ( void   )  [virtual]

Fu and Li (1993) F* statistic. Incorporates correction from Simonsen et al. (1995) Genetics 141: 413, eqn A5.

Warning:
statistic undefined if there are untyped SNPs
Returns:
Fu and Li (1993) F* statistic, or nan if there are no polymorphic sites

Reimplemented in Sequence::PolySIM.

Definition at line 1108 of file PolySNP.cc.

double Sequence::PolySNP::Hprime ( bool  likeThorntonAndolfatto = false  )  [virtual]
Returns:
ThetaPi-ThetaH/(~Var(ThetaPi-ThetaH)). This corresponds to Equation 5 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster" and Equation 13 of Zeng et al. (2006) Genetics 1431-1439
Parameters:
likeThorntonAndolfatto The calculation of H' requires calculation of $\theta^2$. In Thornton and Andolfatto, we simply used $\widehat\theta_W^2$, which is slightly biased. By default, this function calculates $\theta^2=\frac{S(S-1)}{a_n^2+b_n}$, unless this bool is set to false, in which case $\widehat\theta_W^2$ is used.
Note:
returns nan if there are 0 polymorphic sites
Author:
Joshua Shapiro

Reimplemented in Sequence::PolySIM.

Definition at line 668 of file PolySNP.cc.

double Sequence::PolySNP::HudsonsC ( void   ) 
Returns:
Hudson's (1987) estimator of $\rho=4Nc$, an estimator of the population recombination rate that depends on the variance of the site frequencies. The calculation is made by a call to Recombination::HudsonsC
Note:
Will return nan if there are no polymorphic sites

Definition at line 1290 of file PolySNP.cc.

unsigned Sequence::PolySNP::Minrec ( void   )  [virtual]
Returns:
The minimum number of recombination events observed in the sample (Hudson and Kaplan 1985). Will return SEQMAXUNSIGNED if there are < 2 segregating sites.
Note:
Code is a modification of that provided by Jeff Wall

Reimplemented in Sequence::PolySIM.

Definition at line 1306 of file PolySNP.cc.

unsigned Sequence::PolySNP::NumExternalMutations ( void   )  [virtual]
Returns:
the number of derived singletons.
Note:
For sequence data, an outgroup is required. Will return SEQMAXUNSIGNED if that is not the case.

Reimplemented in Sequence::PolySIM.

Definition at line 616 of file PolySNP.cc.

unsigned Sequence::PolySNP::NumMutations ( void   )  [virtual]
Returns:
the total number of mutations in the data. The number of mutations per site = number of states per site - 1

Reimplemented in Sequence::PolySIM.

Definition at line 564 of file PolySNP.cc.

unsigned Sequence::PolySNP::NumPoly ( void   ) 
Returns:
the number of polymorphic (segregating) sites in data
Examples:
msstats.cc.

Definition at line 547 of file PolySNP.cc.

unsigned Sequence::PolySNP::NumSingletons ( void   )  [virtual]
Returns:
number of polymorphisms that appear once in the data, without respect to ancestral/derived

Reimplemented in Sequence::PolySIM.

Definition at line 583 of file PolySNP.cc.

double Sequence::PolySNP::SamplingVarPi ( void   ) 

Component of variance of mean pairwise differences from sampling. Tajima in Takahata/Clark book, (15)

Warning:
statistic undefined if there are untyped SNPs

Definition at line 976 of file PolySNP.cc.

double Sequence::PolySNP::StochasticVarPi ( void   ) 

Stochastic variance of mean pairwise differences. Tajima in Takahata/Clark book, (14).

Warning:
statistic undefined if there are untyped SNPs

Definition at line 962 of file PolySNP.cc.

double Sequence::PolySNP::TajimasD ( void   )  [virtual]

A common summary of the site frequency spectrum. Proportional to $\widehat\theta_\pi-\widehat\theta_W$. This routine does calculate the denominator of the test statistic.

Warning:
statistic undefined if there are untyped SNPs
Returns:
Tajima's D, or nan if there are no polymorphic sites

Reimplemented in Sequence::PolySIM.

Examples:
slidingWindow.cc, and slidingWindow2.cc.

Definition at line 647 of file PolySNP.cc.

double Sequence::PolySNP::ThetaH ( void   )  [virtual]

Calculate Theta ( = 4Nu) from site homozygosity, a la Fay and Wu (2000). This statistic is problematic in general to calculate when there are multiple hits. The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.

If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T

This keeps the expectation of the statistic equal to $\theta$, and uses the correct number of derived mutations ovserved in the data.

Note:
When using PolySNP, an outgroup is required. When using PolySIM, which is constructed with a SimData *, an outgroup is not required (as the 0,1 coding of the data refers to ancestral and derived, respectively).

Reimplemented in Sequence::PolySIM.

Definition at line 292 of file PolySNP.cc.

double Sequence::PolySNP::ThetaL ( void   )  [virtual]

Calculate Theta ( = 4Nu) from site homozygosity, corresponding to equation 1 in Thornton and Andolfatto (Genetics) "Approximate Bayesian Inference reveals evidence for a recent, severe, bottleneck in a Netherlands population of Drosophila melanogaster," (although we labelled in $theta_\eta$ in that paper) The test requires that the ancestral state (inferred from the outgroup) still be segregating in the ingroup. If that is not true, the site is skipped.

If there are >= 2 derived states inferred, a "missing data" approach is taken.
For example:
Outgroup :
A
Ingroup :
A
A
A
G
G
T
Gets treated as two sites:
A A
A A
A A
G N
G N
N T

This keeps the expectation of the statistic equal to $\theta$, and uses the correct number of derived mutations ovserved in the data.

Note:
For sequence data, an outgroup is required. This requirement is checked by assert()
Author:
Joshua Shapiro

Reimplemented in Sequence::PolySIM.

Definition at line 423 of file PolySNP.cc.

double Sequence::PolySNP::ThetaPi ( void   )  [virtual]

Calculated here as the sum of 1.0 - sum of site homozygosity accross sites.

\[ \widehat\theta_\pi=\sum_{i=1}^{i=S}(1-\sum_{j=1}^{j=4}\frac{k_{j,i} \times (k_{j,i}-1)}{n_i \times (n_i - 1)});k_{j,i}>0 \]

Where $S$ is the number of segregating sites, $k_{j,i}$ is the number of occurences of the $j_{th}$ character state at site $i$, and $n_i$ is the sample size at site $i$. Calculating the statistic in this manner makes it easy to generalize to an arbitrary number of character states per polymorphic site
Also equivalent to sum of site heterozygosities:

\[ \widehat\theta_\pi=\frac{\sum_{i=1}^{i=S}2{p_i}{q_i}}{(\genfrac{}{}{0pt}{}{n}{2})} \]


Also equivalent to mean pairwise differences, but that's slow to calculate.

If there is missing data (indicated by 'N' characters), the sample size is reduced for that site. For example, if the data for the $i_{th}$ site is:
A
A
A
N
N
G
Then ThetaPi is calculated for that site as if the sample size were 4 (not 6), and the polymorphic site frequencies are 3/4 for A and 1/4 for G

Reimplemented in Sequence::PolySIM.

Definition at line 176 of file PolySNP.cc.

double Sequence::PolySNP::ThetaW ( void   )  [virtual]

The classic "Watterson's Theta" statistic, generalized to missing data and multiple mutations per site:

\[ \widehat\theta_w=\sum_{i=1}^{i=S}\frac{S}{\sum_{j=1}^{j=n_i-1}\frac{1}{j}} \]


For this statistic, $S$ is either the number of segregating sites, or the number of mutations on the genealogy and $n_i$ is the sample size at site i. If totMuts == 1, the number of mutations is used, else the number ofsegregating sites is used.

Warning:
Statistic undefined if there are untyped SNPs. In the presence of missing data, ThetaW is calculated as the sum (over all segregating sites) of 1/a_sub_n, where a_sub_n is the denominator of ThetaW, using the number of alleles without missing data as the sample size. More formally, the routine returns a calculation base on an unweighted sample size adjustment accross sites.

Reimplemented in Sequence::PolySIM.

Definition at line 246 of file PolySNP.cc.

double Sequence::PolySNP::VarPi ( void   ) 

Total variance of mean pairwise differences. Tajima in Takahata/Clark book, (13).

Warning:
statistic undefined if there are untyped SNPs

Definition at line 948 of file PolySNP.cc.

double Sequence::PolySNP::VarThetaW ( void   ) 
Returns:
Variance of Watterson's Theta (ThetaW()).
Warning:
statistic undefined if there are untyped SNPs

Definition at line 993 of file PolySNP.cc.

double Sequence::PolySNP::WallsB ( void   )  [virtual]
Returns:
Wall's B Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79
Author:
Kevin Thornton

Reimplemented in Sequence::PolySIM.

Definition at line 827 of file PolySNP.cc.

unsigned Sequence::PolySNP::WallsBprime ( void   )  [virtual]
Returns:
Wall's B' Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79
Author:
Kevin Thornton

Reimplemented in Sequence::PolySIM.

Definition at line 917 of file PolySNP.cc.

double Sequence::PolySNP::WallsQ ( void   )  [virtual]
Returns:
Wall's Q Statistic. Wall, J. (1999) Genetical Research 74, pp 65-79
Author:
Kevin Thornton

Reimplemented in Sequence::PolySIM.

Definition at line 932 of file PolySNP.cc.


The documentation for this class was generated from the following files:
Generated on Thu Aug 11 13:22:03 2011 for libsequence by  doxygen 1.6.3