Pydron - Semi-automatic parallelization
=============================================================

Pydron takes sequential python code and runs it in parallel
on multiple cores, clusters, or cloud infrastructures.

----------------------
Let's get going!
----------------------

.. code-block:: python

	import pydron
   
	@pydron.schedule
	def calibration_pipeline(inputs):
  		outputs = []
  		for input in inputs: 
  			output = process(input)
   			outputs = outputs + [output]
   		return outputs
   	
   	@pydron.functional
   	def process(input):
   		...
   	
This will run all `process` calls in parallel.



.. _api:

----------------------
API
----------------------

.. py:module:: pydron

Pydron's API consists of only two decorators:

.. py:decorator:: pydron.schedule

	Pydron will only parallelize code with this decorator.
	All other code will run as usual.
	
	The function may contain arbitrary code with the
	exception of `try except`, `try finally`, and `with` 
	statements that are not yet implemented.
	
	While arbitrary code is allowed and should
	(in theory at least, there are bugs) produce correct
	results, not every code can be made to run in
	parallel. See  :ref:`best-pratices` for guide lines.
	
.. py:decorator:: pydron.functional

	Pydron can only run function calls in parallel if the
	functions have this decorator. 
	
	The decorator defines a contract between the developer
	and Pydron. The developer may decorate a function
	with `functional` if the following conditions are met:
	
	* The function does not assign global variables or as
	  any other kind of side effects.
	
	* If the function reads global variables they must have
	  been unchanged since the module was loaded. This is
	  typically the case for classes and functions defined
	  in a module. What is excluded is access to dynamic
	  state stored globally.
	  
	* All arguments passed to the function must be serializable
	  with `pickle`.
	  
	* The return value or the thrown exception must be serializable
	  with `pickle`.
	  
	* The function does not modify the objects that are passed
	  as arguments. Often you can return modified copies instead.
	  
	

.. _best-pratices:

----------------------
Best practices
----------------------

Pydron can run two function calls in parallel if the following conditions
are met:

 * The return value of one is not an argument of the other, directly or
   indirectly.
   
 * Both have the :func:`pydron.functional` decorator.
 
 * The code which, in a sequential execution, would be executed between the
   two, must be free of `sync-points` (see below).
  
^^^^^^^^^^^^^^^^^^^^^
Synchonization points
^^^^^^^^^^^^^^^^^^^^^
   
A `sync-point` is an operation that Pydron cannot reason about. It therefore
executes that operation at the same 'time' as it would in a sequential execution.
That is, every operation that comes before must have finished and
all operations that come afterwards have to wait. 

A single `sync-point` inside
a loop forces the iterations to run one after the other, making
parallelism impossible. Therefore `sync-point` should be avoided.

The following operations cause a `sync-point`:

 * Calls to functions without the :func:`pydron.functional` decorator.
   Currently, this includes pretty much all built-in functions
   and functions from libraries since I haven't populated the
   white-lists yet.
 
 * Operations that modify an object. These include:
 
   * Assigning an attribute: `obj.x = ...`
   
   * Assigning a subscript: `obj[i] = ...`
   
   * Augmented Assignment: `obj += ...`
   
   The last one might be a bit surprising since `x += 1`
   is often to be identical to `x = x + 1`. But this might not be
   the case. Some types, including `list` and `numpy.ndarray`, 
   perform the operation in-place, modifying the object instead
   of creating a new one. Pydron currently treads all augmented assignments
   as sync-points, even for data types such as integers. 


----------------------
Configuration
----------------------

Pydron uses a configuration file to figure out where to run the parallel
tasks.

This configuration file must be named `pydron.conf` and is searched in the 
following locations:

 * Path stored in environment variable `PYDRON_CONF`
 * Within the current working directory
 * Within the user's home directory
 * `/etc/pydron.conf`
 
These are in order searched.

The configuration file is in JSON format. 

^^^^^^^^^^^^^^^^^^^^^
Multi-Core Setup
^^^^^^^^^^^^^^^^^^^^^

A typical configuration file to use multiple cores on the local machine
would look like this::

	{
	    "workers": [
	        {
	        "type":"multicore",
	        "cores":4
	        }
	    ]
	}

This will start four additional Python interpreters on the local machine when
the `@schedule` decorated function is invoked. It will also terminate them
afterwards.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Remote workers with SSH
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To use more than just the cores available locally, we can also connect
to a remote machine::

	{
	    "workers": [
	        {
		        "type":"ssh",
		        "hostname":"node1.mydomain.tld",
		        "username":"myusername",
		        "password":"mypassword",
		        "tmp_dir":"/tmp",
		        "cores":4
	        }
	    ]
	}

This will start four Python interpreters on `node1.mydomain.tld`. 

Pydron is using TCP connections to talk to the processes started. Connections
are potentially opened from and to the node. If there is a firewall between
your workstation and the nodes or between nodes then this will typically cause
problems. 

To use more than just one remote machine, put multiple entries into the
`workers` configuration section::

	{
	    "workers": [
	        {
		        "type":"ssh",
		        "hostname":"node1.mydomain.tld",
		        "username":"myusername",
		        "password":"mypassword",
		        "tmp_dir":"/tmp",
		        "cores":4
	        },
	        {
		        "type":"ssh",
		        "hostname":"node2.mydomain.tld",
		        "username":"myusername",
		        "password":"mypassword",
		        "tmp_dir":"/tmp",
		        "cores":4
	        },
	        {
		        "type":"ssh",
		        "hostname":"node3.mydomain.tld",
		        "username":"myusername",
		        "password":"mypassword",
		        "tmp_dir":"/tmp",
		        "cores":4
	        }
	    ]
	}

In this example, 12 cores would be used.

If things go wrong, it can happen that processes keep running on the remote
machine. There are a few safety mechanisms in place to avoid this, but
they are not perfect. I recommend checking occasionally if there are some
processes left over, especially after a run hasn't cleanly completed.

^^^^^^^^^^^^^^^^^^^^^
Cloud Computing
^^^^^^^^^^^^^^^^^^^^^

You can also let Pydron start instances in the cloud. Pydron is using Apache 
libcloud, so we should support a wide range of cloud providers. I have
only tried this with Amazon EC2 though.
 
There are a few important issues you need to be aware of:

 * Pydron cannot guarantee that the instances will be terminated in all
   cases. It will try, but there is always a risk that instances are
   left behind. **THESE WILL STILL COST YOU MONEY**. Check afterwards
   if everything was terminated properly. Accidentally running hundreds of big 
   instances over the weekend will get really expensive. You've been warned. 
   Don't send the bill to me.
   
 * The instances are started when the `@schedule` decorated function is invoked
   and a 'best effort' attempt is made to terminate them at the end. 
   
   Depending on the cloud provider they might not charge you by the minute
   but by the hour or by some other interval. If the execution takes only five
   minutes you might still end up paying for a longer period.
   
   Ideally, Pydron would keep the instances running so that you can do several
   runs without starting and stopping the instances every time. This is not
   yet been implemented.
   
 * The same disclaimer about firewalls, as with SSH above applies.
   Since all EC2 instances are behind a network address translating router
   and only get internal IPs this is particularly a problem. It works fine
   though, if the machine from where Pydron is used, is also a EC2 instance.
   This has the additional benefit of reducing the latency significantly.
   
The configuration file for EC2 will look like this::

	{
	    "workers": [
	        {
		        "type":"cloud",
		        "username":"root",
		        "provider":"ec2_us_east",
		        "accesskeyid":"ABCDEF.....",
		        "accesskey":"AbDef012+Z...",
		        "imageid":"ami-abcdef",
		      	"sizeid":"t1.micro",
		      	"publickey":"C:\...\id_rsa.pub",
		      	"privatekey":"C:\...\id_rsa",
		      	"tmp_dir":"/tmp",
		      	"count":10
	        }
	    ]
	}

This will start ten `t1.micro` instances with the `ami-abcdef` machine image.

We don't currently provide an image, but any Linux image able to run Python 2.7
should do.

Pydron will connect with SSH and login as 
`root` with the given privatekey and run a command like this::

	/usr/bin/env python -c "..."
	
It will also upload a file to `tmp_dir` with SFTP. Its a good idea to have
Pydron installed in the image. It's not strictly required, but it will ensure
that the required libraries are in place.