Author: | Jay Laura, <jlaura@asu.edu> |
---|---|
Revision: | 0.1.2 |
Date: | August 27, 2012 |
Copyright: | This work is in the public domain. |
Abstract: | This document explains how to utilize PyStretch for image analysis and manipulation through a series of processing examples. |
The recent technological advances and increased availability of High Performance Computing (HPC) whether in a local cluster or a cloud based cluster have not permeated the scientific community sufficiently to alleviate the necessity to process datasets which, if read in their entirety, overwhelm available computing resources. This could be a large raster image which can not be loaded into memory (RAM constrained), or a complex algorithm which must iterate, multiple times an input array (CPU constrained). Previously, it was necessary to segment or sample data in order to allow desktop machines, often with multiple unused cores, to process data. To that end, PyStretch has been designed to facilitate the analysis and manipulation of raster (image) data which traditionally would have exceeding desktop compute capabilities.
To facilitate processing of large raster images, PyStretch utilizes both a user defined segmentation scheme and python’s built in multiprocessing module. In this way a user can perform two key functions without having to externally segment, perhaps using GDAL, their imagery, or wait during excessively long runtimes as complex filters are applied. While the python multiprocessing module may not be the fastest possible implementation of multi-core processing in python, it does have a few key advantages:
Segmentation is implemented to allow large raster datasets to be internally segmented and loaded into memory piecewise. After processing the initial segment of the image, the built-in garbage collector handles cleanup and RAM and freed to allow for the next segment to be read. In this way PyStretch can handle extremely large images on a small RAM footprint. This is a trade-off though. As segments get smaller and smaller, the overhead to spawn additional processes (multiprocessing) becomes a larger percentage of total compute time. This is especially true on the windows OS due to the nature of os.fork(). Additionally, the smaller the segment, the more often the program will read and write to disk which has been shown to cause IO bottlenecks due to thrashing. Users are advised to heuristically determine what segment sizes work well for their particular data prior to performing a long run on dozens or hundreds of images.
Finally, PyStretch was designed to work with a large array of image formats, data types, and spatial projections. To that end PyStretch strives to fulfill the following tenants:
Support 8 - 32bit signed and unsigned datatypes for reading and writing
Support data scaling (default 1 - 255, but user definable)
Propagate a single No Data Value (NDV) if it exists in the input dataset
Allow users to define a NDV in their data
Generate a copy of the input dataset to preserve original data and allow iterative image manipulation
Much of this functionality depends upon the Geospatial Data Abstraction Library (GDAL)
Below are the extended installation instructions for PyStretch. These instructions assume that python 2.6 or greater is already installed on your machine. Additionally, it is hope that you are running PyStretch in a 64-bit environment as this allows the process access to significantly more RAM. In a 32-bit environment, be prepared to further increase the number of segments (thereby reducing the total size of each segment) as you are capped at approximately 2GB of RAM per process.
Both 32-bit and 64-bit binaries are available for download via pypi.python.org/PyStretch. Be sure to select the correct, 32-bit or 64-bit installation for the version of python that you use. If you are unsure:
Press Start
Type cmd and hit enter
Type:
$ python
Examine the first few lines of output, paying careful attention to whether you are running a 32 or a 64 bit version of python
Download the appropriate installer and install as you would any other package. If the installer is unable to find your python installation in the registry, you either do not have a python installation or installed the wrong architecture binary (32-bit vs. 64-bit).
Finally, the installer places two scripts into a Scripts directory. To allow these scripts to be callable from anywhere, you should add them to your PYTHONPATH. To do this you must have administrative rights on your machine.
Press Start
Right click on Computer and select Properties
In the left hand column select Advanced System Settings
Authenticate
Select Environment Variables
Add the path to the scripts file to the end of string of paths, for example:
;C\Python27\Scripts
Note
Where, C is the drive, Python27 is your version, and Scripts holds the installed scripts
Under System Variables select New...
Set the Variable Name to PYTHONPATH
Set the Variable Value to the path to the scripts folder, for example:
C\Python27\Scripts;
Note
Where, C is the drive, Python27 is your version, and Scripts holds the installed scripts
Note
Semi-colons are used to separate items in the path. Make sure not to omit them!
Installation on OS X or linux is most easily achieved by utilizing either easy_install or pip. Easy install should ship with the OS X python installation and users can simply call the following, from a command line, to install PyStretch.:
$ python easy_install pystretch
It is suggested that pip be utilized for package installation as it has a wider range of error handling capabilities. The installation of pip is not PyStretch specific and opens a wide range of available packages to the user. From the command line call the following to install both pip and PyStretch:
$ python easy_install pip
$ pip install pystretch
Finally, PyStretch can be installed via source using:
$ python setup.py install
PyStretch attempts to install into the site-packages directory of the default python installation. This is, the installation of python which is instantiated when you call python from a command line. Additionally, PyStretch installs two scripts (pystretcher.py and pystretch_test.py) into your bin/ or Scripts/ directory. Assuming these scripts are successfully installed, it is possible to call them via the command line from any directory.