The popgenstat module provides functions that calculate some common population genetic summary statistics.
For example, given a DnaCharacterMatrix as an argument, the num_segregating_sites function returns the raw number of segregating sites, average_number_of_pairwise_differences returns the average number of pairwise differences, and nucleotide_diversity returns the nucleotide diversity.
More complex statistics are provided by the PopulationPairSummaryStatistics class. Objects of this class are instantatiated with two lists of CharacterDataVector objects as arguments, each representing a sample of DNA sequences drawn from two distinct but related populations. Once instantiated, the following attributes of the PopulationPairSummaryStatistics object are available:
- The average number of pairwise differences between every sequence across both populations.
- The average number of pairwise differences between every sequence between both populations.
- The average number of pairwise differences between every sequence within each population.
- The net number of pairwise differences.
- The number of segregating sites.
- Watterson’s theta.
- Wakeley’s psi.
- Tajima’s D.
The following example calculates the suite of population genetic summary statistics for sequences drawn from two populations of sticklebacks. The original data consists of 23 sequences, with individuals from Eastern Pacific populations identified by their taxon labels beginning with “EPAC” and individuals from Western Pacific populations identified by their taxon labels beginning with “WPAC”. The taxon labels thus are used as the basis for sorting the sequences into the required lists of CharacterDataVector objects, p1 and p2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
#! /usr/bin/env python import dendropy from dendropy import popgenstat seqs = dendropy.DnaCharacterMatrix.get_from_path("orti1994.nex", schema="nexus") p1 =  p2 =  for idx, t in enumerate(seqs.taxon_set): if t.label.startswith('EPAC'): p1.append(seqs[t]) else: p2.append(seqs[t]) pp = popgenstat.PopulationPairSummaryStatistics(p1, p2) print('Average number of pairwise differences (total): %s' \ % pp.average_number_of_pairwise_differences) print('Average number of pairwise differences (between populations): %s' \ % pp.average_number_of_pairwise_differences_between) print('Average number of pairwise differences (within populations): %s' \ % pp.average_number_of_pairwise_differences_within) print('Average number of pairwise differences (net): %s' \ % pp.average_number_of_pairwise_differences_net) print('Number of segregating sites: %s' \ % pp.num_segregating_sites) print("Watterson's theta: %s" \ % pp.wattersons_theta) print("Wakeley's Psi: %s" \ % pp.wakeleys_psi) print("Tajima's D: %s" \ % pp.tajimas_d)
Lines 6-12 build up the two lists of CharacterDataVector objects by sorting the original sequences into their source populations based on the taxon label (with operational taxonomic units with labels beginning with “EPAC” coming from the Eastern Pacific, and assigned to the list p1, while those that begin with “WPAC” coming from the Western Pacific, and assigned to the list p2). These lists are then passed as the instantiation arguments to the PopulationPairSummaryStatistics constructor in line 14. The calculations are performed immediately, and the results reported in the following lines.