Comparing two algorithms¶
This is a realistic example of using nestly
to examine the performance of two algorithms. Source code to run it is available in examples/adcl/
.
We will use the min_adcl_tree
subcommand of the rppr
tool from the
pplacer
suite, available from http://matsen.fhcrc.org/pplacer.
This tool chooses k
representative leaves from a phylogenetic tree.
There are two implementations: the Full algorithm solves the problem
exactly, while the PAM algorithm uses a variation on the partitioning
among medoids heuristic to find a solution.
We’d like to compare the two algorithms on a variety of trees, using different values for k
.
Making the nest¶
Setting up the comparison is demonstrated in 00make_nest.py
, which builds
up combinations of (algorithm, tree, k)
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | #!/usr/bin/env python
# This example compares runtimes of two implementations of
# an algorithm to minimize the average distance to the closest leaf
# (Matsen et. al., accepted to Systematic Biology).
#
# To run it, you'll need the `rppr` binary on your path, distributed as part of
# the pplacer suite. Source code, or binaries for OS X and 64-bit Linux are
# available from http://matsen.fhcrc.org/pplacer/.
#
# The `rppr min_adcl_tree` subcommand takes a tree, an algorithm name, and
# the number of leaves to keep.
#
# We wish to explore the runtime, over each tree, for various leaf counts.
import glob
from os.path import abspath
from nestly import Nest, stripext
# The `trees` directory contains 5 trees, each with 1000 leaves.
# We want to run each algorithm on all of them.
trees = [abspath(f) for f in glob.glob('trees/*.tre')]
n = Nest()
# We'll try both algorithms
n.add('algorithm', ['full', 'pam'])
# For every tree
n.add('tree', trees, label_func=stripext)
# Store the number of leaves - always 1000 here
n.add('n_leaves', [1000], create_dir=False)
# Now we vary the number of leaves to keep (k)
# Sample between 1 and the total number of leaves.
def k(c):
n_leaves = c['n_leaves']
return range(1, n_leaves, n_leaves // 10)
# Add `k` to the nest.
# This will call k with each combination of (algorithm, tree, n_leaves).
# Each value returned will be used as a possible value for `k`
n.add('k', k)
# Build the nest:
n.build('runs')
|
Running that:
$ ./00make_nest.py
Creates a new directory, runs
.
Within this directory are subdirectories for each algorithm:
runs/full
runs/pam
Each of these contains a directory for each tree used:
$ ls runs/pam
random001 random002 random003 random004 random005
Within each of these subdirectories are directories for each choice of k
.
$ ls runs/pam/random001
1 101 201 301 401 501 601 701 801 901
These directories are leaves. There is a JSON file in each, containing the choices made. For example,
runs/full/random003/401/control.json
contains:
{
"algorithm": "full",
"tree": "/home/cmccoy/development/nestly/examples/adcl/trees/random003.tre",
"n_leaves": 1000,
"k": 401
}
Running the algorithm¶
The nestrun
command-line tool allows you to run a command for each combination of parameters in a nest.
It allows you to substitute parameters chosen by surrounding them in curly brackets, e.g. {algorithm}
.
To see how long, and how much memory each run uses, we’ll use the short shell script time_rppr.sh
:
1 2 3 4 5 6 | #!/bin/sh
export TIME='elapsed,maxmem,exitstatus\n%e,%M,%x'
/usr/bin/time -o time.csv \
rppr min_adcl_tree --algorithm {algorithm} --leaves {k} {tree}
|
Note the placeholders for the parameters to be provided at runtime: k
, tree
, and algorithm
.
Running a script like time_rppr.sh
on every experiment within a nest in parallel is facilitated by the nestrun
script distributed with nestly
:
$ nestrun -j 4 --template-file time_rppr.sh -d runs
(this will take awhile)
This command runs the shell script time_rppr.sh
for each parameter choice, substituting the appropriate parameters.
The -j 4
flag indicates that 4 processors should be used.
Aggregating results¶
Now we have a little CSV file in each leaf directory, containing the running time:
|----------+--------+-------------|
| elapsed | maxmem | exitstatus |
|----------+--------+-------------|
| 17.78 | 471648 | 0 |
|----------+--------+-------------|
To analyze these en-masse, we need to combine them and add information about the parameters used to generate them. The nestagg
script does just this.
$ nestagg delim -d runs -o results.csv time.csv -k algorithm,k,tree
Where -d runs
indicates the directory containing program runs; -o
results.csv
specifies where to write the output; time.csv
gives the name
of the file in each leaf directory, and -k algorithm,k,tree
lists the
parameters to add to each row of the CSV files.
Looking at results.csv
:
|----------+---------+------------+-----------+---------------------------------------+------|
| elapsed | maxmem | exitstatus | algorithm | tree | k |
|----------+---------+------------+-----------+---------------------------------------+------|
| 17.04 | 941328 | 0 | full | .../examples/adcl/trees/random001.tre | 1 |
| 20.86 | 944336 | 0 | full | .../examples/adcl/trees/random001.tre | 101 |
| 31.75 | 944320 | 0 | full | .../examples/adcl/trees/random001.tre | 201 |
| 39.34 | 980048 | 0 | full | .../examples/adcl/trees/random001.tre | 301 |
| 37.84 | 1118960 | 0 | full | .../examples/adcl/trees/random001.tre | 401 |
| 42.15 | 1382000 | 0 | full | .../examples/adcl/trees/random001.tre | 501 |
etc
Now we have something we can look at!
So: PAM is faster for large k
, and always has lower maximum memory use.
(generated by examples/adcl/03analyze.R
)