The PyStretch User Manual

Author:Jay Laura, <jlaura@asu.edu>
Revision:0.1.2
Date:August 27, 2012
Copyright:This work is in the public domain.
Abstract:This document explains how to utilize PyStretch for image analysis and manipulation through a series of processing examples.

Introduction

The recent technological advances and increased availability of High Performance Computing (HPC) whether in a local cluster or a cloud based cluster have not permeated the scientific community sufficiently to alleviate the necessity to process datasets which, if read in their entirety, overwhelm available computing resources. This could be a large raster image which can not be loaded into memory (RAM constrained), or a complex algorithm which must iterate, multiple times an input array (CPU constrained). Previously, it was necessary to segment or sample data in order to allow desktop machines, often with multiple unused cores, to process data. To that end, PyStretch has been designed to facilitate the analysis and manipulation of raster (image) data which traditionally would have exceeding desktop compute capabilities.

Multiprocessing

To facilitate processing of large raster images, PyStretch utilizes both a user defined segmentation scheme and python’s built in multiprocessing module. In this way a user can perform two key functions without having to externally segment, perhaps using GDAL, their imagery, or wait during excessively long runtimes as complex filters are applied. While the python multiprocessing module may not be the fastest possible implementation of multi-core processing in python, it does have a few key advantages:

  • Multiprocessing is built-in for python installations greater than 2.6 - no additional downloads are necessary
  • Multiprocessing is well documented and tested by the python community
  • Multiprocessing is able to utilize shared memory ctype arrays

Image Segmentation

Segmentation is implemented to allow large raster datasets to be internally segmented and loaded into memory piecewise. After processing the initial segment of the image, the built-in garbage collector handles cleanup and RAM and freed to allow for the next segment to be read. In this way PyStretch can handle extremely large images on a small RAM footprint. This is a trade-off though. As segments get smaller and smaller, the overhead to spawn additional processes (multiprocessing) becomes a larger percentage of total compute time. This is especially true on the windows OS due to the nature of os.fork(). Additionally, the smaller the segment, the more often the program will read and write to disk which has been shown to cause IO bottlenecks due to thrashing. Users are advised to heuristically determine what segment sizes work well for their particular data prior to performing a long run on dozens or hundreds of images.

Spatial Data, Formats & No Data Values

Finally, PyStretch was designed to work with a large array of image formats, data types, and spatial projections. To that end PyStretch strives to fulfill the following tenants:

  • Support 8 - 32bit signed and unsigned datatypes for reading and writing

  • Support a myriad of input and output data formats including:
    • Geotiff
    • JP2000 - With appropriate plugins to GDAL
    • .cub - USGS’s ISIS3 data cube format
  • Support data scaling (default 1 - 255, but user definable)

  • Propagate a single No Data Value (NDV) if it exists in the input dataset

  • Allow users to define a NDV in their data

  • Generate a copy of the input dataset to preserve original data and allow iterative image manipulation

  • Propagate spatial data contained within the image or image support files (.aux.xml or .tfw for example)
    • Projection
    • Transformation
    • Georeferencing
  • Handle multi and hyper-spectral imagery
    • One caveat - Currently PyStretch processes all bands. User specified band processing is on the TODO list.

Much of this functionality depends upon the Geospatial Data Abstraction Library (GDAL)

Installation

Below are the extended installation instructions for PyStretch. These instructions assume that python 2.6 or greater is already installed on your machine. Additionally, it is hope that you are running PyStretch in a 64-bit environment as this allows the process access to significantly more RAM. In a 32-bit environment, be prepared to further increase the number of segments (thereby reducing the total size of each segment) as you are capped at approximately 2GB of RAM per process.

Windows

Both 32-bit and 64-bit binaries are available for download via pypi.python.org/PyStretch. Be sure to select the correct, 32-bit or 64-bit installation for the version of python that you use. If you are unsure:

  • Press Start

  • Type cmd and hit enter

  • Type:

    $ python
  • Examine the first few lines of output, paying careful attention to whether you are running a 32 or a 64 bit version of python

Download the appropriate installer and install as you would any other package. If the installer is unable to find your python installation in the registry, you either do not have a python installation or installed the wrong architecture binary (32-bit vs. 64-bit).

Finally, the installer places two scripts into a Scripts directory. To allow these scripts to be callable from anywhere, you should add them to your PYTHONPATH. To do this you must have administrative rights on your machine.

  • Press Start

  • Right click on Computer and select Properties

  • In the left hand column select Advanced System Settings

  • Authenticate

  • Select Environment Variables

  • If PYTHONPATH exists in System Variables:
    • Add the path to the scripts file to the end of string of paths, for example:

      ;C\Python27\Scripts

Note

Where, C is the drive, Python27 is your version, and Scripts holds the installed scripts

  • Otherwise:
    • Under System Variables select New...

    • Set the Variable Name to PYTHONPATH

    • Set the Variable Value to the path to the scripts folder, for example:

      C\Python27\Scripts;

Note

Where, C is the drive, Python27 is your version, and Scripts holds the installed scripts

Note

Semi-colons are used to separate items in the path. Make sure not to omit them!

Installing on OS X or Linux

Installation on OS X or linux is most easily achieved by utilizing either easy_install or pip. Easy install should ship with the OS X python installation and users can simply call the following, from a command line, to install PyStretch.:

$ python easy_install pystretch

It is suggested that pip be utilized for package installation as it has a wider range of error handling capabilities. The installation of pip is not PyStretch specific and opens a wide range of available packages to the user. From the command line call the following to install both pip and PyStretch:

$ python easy_install pip
$ pip install pystretch

Finally, PyStretch can be installed via source using:

$ python setup.py install

Installation directories

PyStretch attempts to install into the site-packages directory of the default python installation. This is, the installation of python which is instantiated when you call python from a command line. Additionally, PyStretch installs two scripts (pystretcher.py and pystretch_test.py) into your bin/ or Scripts/ directory. Assuming these scripts are successfully installed, it is possible to call them via the command line from any directory.