Comparing two algorithms
========================


This is a realistic example of using ``nestly`` to examine the performance of two algorithms. Source code to run it is available in ``examples/adcl/``.

We will use the ``min_adcl_tree`` subcommand of the ``rppr`` tool from the
``pplacer`` suite, available from http://matsen.fhcrc.org/pplacer.

This tool chooses ``k`` `representative leaves from a phylogenetic tree`_.
There are two implementations: the **Full** algorithm solves the problem
exactly, while the **PAM** algorithm uses a variation on the `partitioning
among medoids`_ heuristic to find a solution.

We'd like to compare the two algorithms on a variety of trees, using different values for ``k``.

Making the nest
---------------

Setting up the comparison is demonstrated in ``00make_nest.py``, which builds
up combinations of ``(algorithm, tree, k)``:

.. literalinclude:: ../examples/adcl/00make_nest.py
   :language: python
   :linenos:

Running that:

.. code-block:: sh

   $ ./00make_nest.py

Creates a new directory, ``runs``.

Within this directory are subdirectories for each algorithm::

   runs/full
   runs/pam

Each of these contains a directory for each tree used:

.. code-block:: sh

   $ ls runs/pam
   random001  random002  random003  random004  random005


Within each of *these* subdirectories are directories for each choice of ``k``.

.. code-block:: sh

   $ ls runs/pam/random001
   1  101  201  301  401  501  601  701  801  901

These directories are leaves. There is a JSON_ file in each, containing the choices made. For example,
``runs/full/random003/401/control.json`` contains:

.. code-block:: javascript

   {
     "algorithm": "full", 
     "tree": "/home/cmccoy/development/nestly/examples/adcl/trees/random003.tre", 
     "n_leaves": 1000, 
     "k": 401
   }

Running the algorithm
---------------------

The ``nestrun`` command-line tool allows you to run a command for each combination of parameters in a nest.
It allows you to substitute parameters chosen by surrounding them in curly brackets, e.g. ``{algorithm}``.

To see how long, and how much memory each run uses, we'll use the short shell script ``time_rppr.sh``:

.. literalinclude:: ../examples/adcl/time_rppr.sh
   :language: sh
   :linenos:

Note the placeholders for the parameters to be provided at runtime: ``k``, ``tree``, and ``algorithm``.

Running a script like ``time_rppr.sh`` on every experiment within a nest in parallel is facilitated by the ``nestrun`` script distributed with ``nestly``:

.. code-block:: sh

   $ nestrun -j 4 --template-file time_rppr.sh -d runs

(this will take awhile)

This command runs the shell script ``time_rppr.sh`` for each parameter choice, substituting the appropriate parameters.
The ``-j 4`` flag indicates that 4 processors should be used.

Aggregating results
-------------------

Now we have a little CSV file in each leaf directory, containing the running time::

    |----------+--------+-------------|
    |  elapsed | maxmem | exitstatus  |
    |----------+--------+-------------|
    |  17.78   | 471648 | 0           |
    |----------+--------+-------------|

To analyze these en-masse, we need to combine them and add information about the parameters used to generate them. The ``nestagg`` script does just this.

.. code-block:: sh

   $ nestagg delim -d runs -o results.csv time.csv -k algorithm,k,tree

Where ``-d runs`` indicates the directory containing program runs; ``-o
results.csv`` specifies where to write the output; ``time.csv`` gives the name
of the file in each leaf directory, and ``-k algorithm,k,tree`` lists the
parameters to add to each row of the CSV files.

Looking at ``results.csv``::

     |----------+---------+------------+-----------+---------------------------------------+------|
     |  elapsed | maxmem  | exitstatus | algorithm | tree                                  | k    |
     |----------+---------+------------+-----------+---------------------------------------+------|
     |  17.04   | 941328  | 0          | full      | .../examples/adcl/trees/random001.tre | 1    |
     |  20.86   | 944336  | 0          | full      | .../examples/adcl/trees/random001.tre | 101  |
     |  31.75   | 944320  | 0          | full      | .../examples/adcl/trees/random001.tre | 201  |
     |  39.34   | 980048  | 0          | full      | .../examples/adcl/trees/random001.tre | 301  |
     |  37.84   | 1118960 | 0          | full      | .../examples/adcl/trees/random001.tre | 401  |
     |  42.15   | 1382000 | 0          | full      | .../examples/adcl/trees/random001.tre | 501  |
     etc

Now we have something we can look at!

.. image:: _static/adcl_plot_result.png

So: PAM is faster for large ``k``, and always has lower maximum memory use.

(generated by ``examples/adcl/03analyze.R``)

.. _`partitioning among medoids`: http://en.wikipedia.org/wiki/K-medoids

.. _JSON: http://www.json.org

.. _`representative leaves from a phylogenetic tree`: http://matsen.fhcrc.org/general/2012/05/31/adcl-paper.html