# 3.3. Tree Statistics, Metrics, and Calculations¶

## 3.3.1. Tree Length¶

The length method returns the sum of edge lengths of a Tree object, with edges that do not have any length assigned being treated as edges with length 0. The following example shows how to identify the “critical” value for an Archie-Faith-Cranston or PTP test from a sample of Tree objects, i.e. a tree length equal to or greater than 95% of the trees in the sample:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```#! /usr/bin/env python import dendropy trees = dendropy.TreeList.get_from_path("pythonidae.random.bd0301.tre", "nexus") tree_lengths = [tree.length() for tree in trees] tree_lengths.sort() crit_index_95 = int(0.95 * len(tree_lengths)) crit_index_99 = int(0.99 * len(tree_lengths)) print("95%% critical value: %s" % tree_lengths[crit_index_95]) print("99%% critical value: %s" % tree_lengths[crit_index_99]) ```

## 3.3.2. Node Ages¶

The calc_node_ages method calculates the age of a node (i.e., the sum of edge lengths from the node to a tip) and assigns it to the age attribute. The following example iterates through the post-burn-in of an MCMC sample of ultrametric trees, calculating the age of the MRCA of two taxa, and reports the mean age of the node.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```#! /usr/bin/env python import dendropy trees = dendropy.TreeList.get_from_path("pythonidae.beast-mcmc.trees", "nexus", tree_offset=200) maculosa_childreni_ages = [] for idx, tree in enumerate(trees): tree.calc_node_ages() node1 = tree.find_node_with_taxon_label(label="Antaresia maculosa") node2 = tree.find_node_with_taxon_label(label="Antaresia childreni") mrca = dendropy.Tree.mrca(node1, node2) maculosa_childreni_ages.append(mrca.age) print("Mean age of MRCA of 'Antaresia maculosa' and 'Antaresia childreni': %s" \ % (float(sum(maculosa_childreni_ages))/len(maculosa_childreni_ages))) ```

## 3.3.3. Pybus-Harvey Gamma¶

The Pybus-Harvey Gamma statistic is given by the pybus_harvey_gamma instance method. The following example iterates through the post-burn-in of an MCMC sample of trees, reporting the mean Pybus-Harvey Gamma statistic:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ```#! /usr/bin/env python import dendropy trees = dendropy.TreeList.get_from_path("pythonidae.beast-mcmc.trees", "nexus", tree_offset=200) pbhg = [] for idx, tree in enumerate(trees): pbhg.append(tree.pybus_harvey_gamma()) print("Mean Pybus-Harvey-Gamma: %s" \ % (float(sum(pbhg))/len(pbhg))) ```

## 3.3.4. Patristic Distances¶

The PatristicDistanceMatrix is the most efficient way to calculate the patristic distances between taxa or leaves on a tree, when doing multiple such calculations. Its constructor takes a Tree object as an argument, and the object return is callable, taking two Taxon objects as arguments and returning the sum of edge lengths between the two. The following example reports the pairwise distances between all taxa on the input tree:

 ``` 1 2 3 4 5 6 7 8 9 10``` ```#! /usr/bin/env python import dendropy from dendropy import treecalc tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus") pdm = treecalc.PatristicDistanceMatrix(tree) for i, t1 in enumerate(tree.taxon_set): for t2 in tree.taxon_set[i+1:]: print("Distance between '%s' and '%s': %s" % (t1.label, t2.label, pdm(t1, t2))) ```

## 3.3.5. Probability Under the Coalescent Model¶

The coalescent module provides a range of methods for simulations and calculations under Kingman’s coalescent framework and related models. For example:

log_probability_of_coalescent_tree
Given a Tree object as the first argument, and the haploid population size as the second, returns the log probability of the Tree under the neutral coalescent.
kl_divergence_coalescent_trees
Reports the Kullback-Leilber divergence of a list of trees from the theoretical distribution of neutral coalescent trees. Requires the de Hoon statistics package package to be installed.

## 3.3.6. Numbers of Deep Coalescences¶

reconciliation_discordance
Given two Tree objects sharing the same leaf-set, this returns the number of deep coalescences resulting from fitting the first tree (e.g., a gene tree) to the second (e.g., a species tree). This is based on the algorithm described Goodman, et al. (Goodman, et al., 1979. Fitting the gene lineage into its species lineage,a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 19: 99-113).

Changed in version 3.3.0: Renamed and moved to reconcile module.

monophyletic_partition_discordance
Given a Tree object as the first argument, and a list of lists of Taxon objects representing the expected monophyletic partitioning of the TaxonSet of the Tree as the second argument, this returns the number of deep coalescences found in the relationships implied by the Tree object, conditional on the taxon groupings given by the second argument. This statistic corresponds to the Slatkin and Maddison (1989) s statistic, as described here.

Changed in version 3.3.0: Renamed and moved to reconcile module.

## 3.3.7. Number of Deep Coalescences when Embedding One Tree in Another (e.g. Gene/Species Trees)¶

Imagine we wanted to generate the distribution of the number of deep coalescences under two scenarios: one in which a population underwent sequential or step-wise vicariance, and another when there was simultaneous fragmentation. In this case, the containing tree and the embedded trees have different leaf sets, and there is a many-to-one mapping of embedded tree taxa to containing tree taxa.

The ContainingTree class is designed to allow for counting deep coalescences in cases like this. It requires a TaxonSetMapping object, which provides an association between the embedded taxa and the containing taxa. The easiest way to get a TaxonSetMapping object is to call the special factory function create_contained_taxon_mapping. This will create a new TaxonSet to manage the gene taxa, and create the associations between the gene taxa and the containing tree taxa for you. It takes two arguments: the TaxonSet of the containing tree, and the number of genes you want sampled from each species. If the gene-species associations are more complex, e.g., different numbers of genes per species, we can pass in a list of values as the second argument to ~dendropy.dataobject.taxon.TaxonSetMapping.create_contained_taxon_mapping(). This approach should be used with caution if we cannot be certain of the order of taxa (as is the case with data read in Newick formats). In these case, and in more complex cases, we might need to directly instantiate the TaxonSetMapping object. The API to describe the associations when constructing this object is very similar to that of the TaxonSetPartition object: you can use a function, attribute or dictionary.

The ContainingTree class has its own native contained coalescent simulator, embed_contained_kingman, which simulates and embeds a contained coalescent tree at the same time.

```#! /usr/bin/env python

import dendropy
from dendropy import treesim
from dendropy import reconcile

# simulation parameters and output
num_reps = 10

# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000"

# taxa and trees
containing_taxa = dendropy.TaxonSet()
stepwise_tree = dendropy.Tree.get_from_string(
stepwise_tree_str,
"newick",
taxon_set=containing_taxa)
frag_tree = dendropy.Tree.get_from_string(
frag_tree_str,
"newick",
taxon_set=containing_taxa)

# taxon set association
genes_to_species = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=containing_taxa,
num_contained=8)

# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)

# for each rep
for rep in range(num_reps):
stepwise_tree.embed_contained_kingman(default_pop_size=40000)
frag_tree.embed_contained_kingman(default_pop_size=40000)

# write results

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
```

If you have used some other method to simulate your trees, you can use embed_tree to embed the trees and count then number of deep coalescences.

```#! /usr/bin/env python

import dendropy
from dendropy import treesim
from dendropy import reconcile

# simulation parameters and output
num_reps = 10

# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000"

# taxa and trees
containing_taxa = dendropy.TaxonSet()
stepwise_tree = dendropy.Tree.get_from_string(
stepwise_tree_str,
"newick",
taxon_set=containing_taxa)
frag_tree = dendropy.Tree.get_from_string(
frag_tree_str,
"newick",
taxon_set=containing_taxa)

# taxon set association
genes_to_species = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=containing_taxa,
num_contained=8)

# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)

# for each rep
for rep in range(num_reps):
gene_tree1 = treesim.contained_coalescent(containing_tree=stepwise_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
stepwise_tree.embed_tree(gene_tree1)
gene_tree2 = treesim.contained_coalescent(containing_tree=frag_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
frag_tree.embed_tree(gene_tree2)

# write results

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])

# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
```

For more details on simulating contained coalescent trees and counting numbers of deep coalescences on them, see “Contained Coalescent Trees” or “Simulating the Distribution of Number Deep Coalescences Under Different Phylogeographic History Scenarios”.

## 3.3.8. Majority-Rule Consensus Tree from a Collection of Trees¶

To get the majority-rule consensus tree of a TreeList object, you can call the consensus instance method. You can specify the frequency threshold for the consensus tree by the min_freq argument, which default to 0.5 (i.e., a 50% majority rule tree). The following example aggregates the post-burn-in trees from four MCMC samples into a single TreeList object, and prints the 95% majority-rule consensus as a Newick string:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15``` ```#! /usr/bin/env python import dendropy trees = dendropy.TreeList() for tree_file in ['pythonidae.mb.run1.t', 'pythonidae.mb.run2.t', 'pythonidae.mb.run3.t', 'pythonidae.mb.run4.t']: trees.read_from_path( tree_file, 'nexus', tree_offset=20) con_tree = trees.consensus(min_freq=0.95) print(con_tree.as_string('newick')) ```

## 3.3.9. Frequency of a Split in a Collection of Trees¶

The frequency_of_split method of a TreeList object returns the frequency of occurrence of a single split across all the Tree objects in the TreeList. The split can be specified by passing a split bitmask directly using the split_bitmask keyword argument, as a list of Taxon objects using the taxa keyword argument, or as a list of taxon labels using the labels keyword argument. The following example shows how to calculate the frequency of a split defined by two taxa, “Morelia amethistina” and “Morelia tracyae”, from the post-burn-in trees aggregated across four MCMC samples:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```#! /usr/bin/env python import dendropy trees = dendropy.TreeList() for tree_file in ['pythonidae.mb.run1.t', 'pythonidae.mb.run2.t', 'pythonidae.mb.run3.t', 'pythonidae.mb.run4.t']: trees.read_from_path( tree_file, 'nexus', tree_offset=20) split_leaves = ['Morelia amethistina', 'Morelia tracyae'] f = trees.frequency_of_split(labels=split_leaves) print('Frequency of split %s: %s' % (split_leaves, f)) ```

## 3.3.10. Tree Distances¶

### 3.3.10.1. Native Tree Methods¶

New in version 3.2.

The Tree class provides methods for calculating distances between two trees:

symmetric_difference
This method returns the symmetric distance between two trees. The symmetric distance between two trees is the sum of the number of splits found in one of the trees but not the other. It is common to see this statistic called the “Robinson-Foulds distance”, but in DendroPy we reserve this term to apply to the Robinson-Foulds distance in the strict sense, i.e., the weighted symmetric distance (see below).
false_positives_and_negatives
This method returns a tuple pair, with the first element the number of splits in the current Tree object not found in the Tree object to which it is being compared, while the second element is the number of splits in the second Tree object that are not in the current Tree. The sum of these two elements is exactly equal to the value reported by symmetric_distance.
euclidean_distance
This method returns the “branch length distance” of Felsenstein (2004), i.e. the sum of absolute differences in branch lengths for equivalent splits between two trees, with the branch length for a missing split taken to be 0.0.
robinson_foulds_distance
This method returns the Robinsons-Foulds distance between two trees, i.e., the sum of the square of differences in branch lengths for equivalent splits between two trees, with the branch length for a missing split taken to be 0.0.

For example:

```>>> import dendropy
>>> s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247)"
>>> s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247)"
>>> tree1 = dendropy.Tree.get_from_string(s1, 'newick')
>>> tree2 = dendropy.Tree.get_from_string(s2, 'newick')
>>> tree1.symmetric_difference(tree2)
0
>>> tree1.false_positives_and_negatives(tree2)
(0, 0)
>>> tree1.euclidean_distance(tree2)
2.2232636377544162
>>> tree1.robinson_foulds_distance(tree2)
2.971031
```

### 3.3.10.2. Using the treecalc Module¶

The treecalc module provides for these operations as independent functions that take two Tree objects as arguments. These independent functions require that both trees have the same TaxonSet reference, otherwise an exception is raised:

```>>> import dendropy
>>> from dendropy import treecalc
>>> s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065840,t3:0.170221):0.383247)"
>>> s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:0.170221):0.383247)"
>>> tree1 = dendropy.Tree.get_from_string(s1, 'newick')
>>> tree2 = dendropy.Tree.get_from_string(s2, 'newick')
>>> treecalc.symmetric_difference(tree1, tree2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "treecalc.py", line 240, in symmetric_difference
t = false_positives_and_negatives(tree1, tree2)
File "treecalc.py", line 254, in false_positives_and_negatives
% (hex(id(reference_tree.taxon_set)), hex(id(test_tree.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x10111ec00 vs. 0x10111eaa0
>>> treecalc.euclidean_distance(tree1, tree2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "treecalc.py", line 236, in euclidean_distance
value_type=value_type)
File "treecalc.py", line 160, in splits_distance
% (hex(id(tree1.taxon_set)), hex(id(tree2.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x10111ec00 vs. 0x10111eaa0
>>> treecalc.robinson_foulds_distance(tree1, tree2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "treecalc.py", line 223, in robinson_foulds_distance
value_type=float)
File "treecalc.py", line 160, in splits_distance
% (hex(id(tree1.taxon_set)), hex(id(tree2.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x10111ec00 vs. 0x10111eaa0
>>> tree3 = dendropy.Tree.get_from_string(s2, 'newick', taxon_set=tree1.taxon_set)
>>> treecalc.symmetric_difference(tree1, tree3)
0
>>> treecalc.euclidean_distance(tree1, tree3)
2.2232636377544162
>>> treecalc.robinson_foulds_distance(tree1, tree3)
2.971031
```

#### Next topic

3.4. Tree Manipulation and Restructuring

### Announcements

Join the "DendroPy Announcements" group to receive announcements of new releases, updates, changes and other news of interest to DendroPy users and developers.

Enter your e-mail address in the box above and click the "subscribe" button to subscribe to the "dendropy-announce" group, or click here to visit this group page directly.

### Discussion

Join the "DendroPy Users" group to follow and participate in discussion, troubleshooting, help, information, suggestions, etc. on the usage and development of the DendroPy phylogenetic computing library.

Enter your e-mail address in the box above and click the "subscribe" button to subscribe to the "dendropy-users" group, or click here to visit this group page directly.