User Manual¶
Introduction¶
Geospatial data typically comes in one of two data models:
rasters which are similar to images with a regular grid of pixels whose values represent some spatial phenomenon (e.g. elevation) and
vectors which are entities with discrete geometries (e.g. state boundaries).
This software, rasterstats
, exists solely to extract information from geospatial raster data
based on vector geometries.
Primarily, this involves zonal statistics: a method of summarizing and aggregating the raster values intersecting a vector geometry. For example, zonal statistics provides answers such as the mean precipitation or maximum elevation of an administrative unit. Additionally, functions are provided for point queries, most notably the ability to query a raster at a point and get an interpolated value rather than the simple nearest pixel.
Basic Example¶
The typical usage of rasterstats functions involves two arguments, a vector and a raster dataset:
>>> from rasterstats import zonal_stats, point_query
>>> stats = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif')
>>> pts = point_query('tests/data/points.shp', 'tests/data/slope.tif')
zonal_stats
gives us a list of two dictionaries corresponding to each input polygon:
>>> from pprint import pprint
>>> pprint(stats)
[{'count': 75,
'max': 22.273418426513672,
'mean': 14.660084635416666,
'min': 6.575114727020264},
{'count': 50,
'max': 82.69043731689453,
'mean': 56.60576171875,
'min': 16.940950393676758}]
while point_query
gives us a list of raster values corresponding to each input point:
>>> pts
[14.037668283186257, 33.1370268256543, 36.46848854950241]
Vector Data Sources¶
The most common use case is having vector data sources in a file such as an ESRI Shapefile or any
other format supported by fiona
. The path to the file can be passed in directly as the first argument:
>>> zs = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif')
If you have multi-layer sources, you can specify the layer
by either name or index:
>>> zs = zonal_stats('tests/data', 'tests/data/slope.tif', layer="polygons")
In addition to the basic usage above, rasterstats supports other mechanisms of specifying vector geometries.
The vector argument can be an iterable of GeoJSON-like features such as a fiona source:
>>> import fiona
>>> with fiona.open('tests/data/polygons.shp') as src:
... zs = zonal_stats(src, 'tests/data/slope.tif')
You can also pass in an iterable of python objects that support
the __geo_interface__
(e.g. Shapely, ArcPy, PyShp, GeoDjango):
>>> from shapely.geometry import Point
>>> pt = Point(245000, 1000000)
>>> pt.__geo_interface__
{'type': 'Point', 'coordinates': (245000.0, 1000000.0)}
>>> point_query([pt], 'tests/data/slope.tif')
[21.32739672330894]
Strings in well known text (WKT) and binary (WKB) format
>>> pt.wkt
'POINT (245000 1000000)'
>>> point_query([pt], 'tests/data/slope.tif')
[21.32739672330894]
>>> pt.wkb
'\x01\x01\x00\x00\x00\x00\x00\x00\x00@\xe8\rA\x00\x00\x00\x00\x80\x84.A'
>>> point_query([pt], 'tests/data/slope.tif')
[21.32739672330894]
Raster Data Sources¶
Any format that can be read by rasterio
is supported by rasterstats
.
To test if a data source is supported by your installation (this might differ depending on the
format support of the underlying GDAL library), use the rio command line tool:
$ rio info raster.tif
You can specify the path to the raster directly:
>>> zs = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif')
If the raster contains multiple bands, you must specify the band (1-indexed):
>>> zs = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif', band=1)
Or you can pass a numpy ndarray
with an affine transform mapping the array dimensions
to a coordinate reference system:
>>> import rasterio
>>> with rasterio.open('tests/data/slope.tif') as src:
... affine = src.affine
... array = src.read(1)
>>> zs = zonal_stats('tests/data/polygons.shp', array, affine=affine)
Zonal Statistics¶
Statistics¶
By default, the zonal_stats
function will return the following statistics
- min
- max
- mean
- count
Optionally, these statistics are also available.
- sum
- std
- median
- majority
- minority
- unique
- range
- nodata
- percentile (see note below for details)
You can specify the statistics to calculate using the stats
argument:
>>> stats = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats=['min', 'max', 'median', 'majority', 'sum'])
You can also specify as a space-delimited string:
>>> stats = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats="min max median majority sum")
Note that certain statistics (majority, minority, and unique) require significantly more processing due to expensive counting of unique occurences for each pixel value.
You can also use a percentile statistic by specifying
percentile_<q>
where <q>
can be a floating point number between 0 and 100.
User-defined Statistics¶
You can define your own aggregate functions using the add_stats
argument.
This is a dictionary with the name(s) of your statistic as keys and the function(s)
as values. For example, to reimplement the mean statistic:
>>> from __future__ import division
>>> import numpy as np
>>> def mymean(x):
... return np.ma.mean(x)
then use it in your zonal_stats
call like so:
>>> zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats="count",
... add_stats={'mymean':mymean})
[{'count': 75, 'mymean': 14.660084635416666}, {'count': 50, 'mymean': 56.605761718750003}]
GeoJSON output¶
If you want to retain the geometries and properties of the input features,
you can output a list of geojson features using geojson_out
. The features
contain the zonal statistics as additional properties:
>>> stats = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... geojson_out=True)
>>> stats[0]['type']
'Feature'
>>> stats[0]['properties'].keys()
[u'id', 'count', 'max', 'mean', 'min']
Rasterization Strategy¶
There is no right or wrong way to rasterize a vector. The default strategy is to include all pixels along the line render path (for lines), or cells where the center point is within the polygon (for polygons). Alternatively, you can opt for the all_touched
strategy which rasterizes the geometry by including all pixels that it touches. You can enable this specifying:
>>> zs = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... all_touched=True)
The figure above illustrates the difference; the default all_touched=False
is on the left
while the all_touched=True
option is on the right.
Both approaches are valid and there are tradeoffs to consider. Using the default rasterizer may miss polygons that are smaller than your cell size resulting in None
stats for those geometries. Using the all_touched
strategy includes many cells along the edges that may not be representative of the geometry and may give severly biased results in some cases.
Working with categorical rasters¶
You can treat rasters as categorical (i.e. raster values represent discrete classes) if you’re only interested in the counts of unique pixel values.
For example, you may have a raster vegetation dataset and want to summarize
vegetation by polygon. Statistics such as mean, median, sum, etc. don’t make much sense in this context
(What’s the sum of oak + grassland
?).
Using categorical
, the output is dictionary with the unique raster values as keys
and pixel counts as values:
>>> zonal_stats('tests/data/polygons.shp',
... 'tests/data/slope_classes.tif',
... categorical=True)[1]
{1.0: 1, 2.0: 9, 5.0: 40}
rasterstats will report using the pixel values as keys.
To associate the pixel values with their appropriate meaning,
you can use a category_map
:
>>> cmap = {1.0: 'low', 2.0: 'med', 5.0: 'high'}
>>> zonal_stats('tests/data/polygons.shp',
... 'tests/data/slope_classes.tif',
... categorical=True, category_map=cmap)[1]
{'high': 40, 'med': 9, 'low': 1}
“Mini-Rasters”¶
Internally, we create a masked raster dataset for each feature in order to
calculate statistics. Optionally, we can include these data in the output
of zonal_stats
using the raster_out
argument:
>>> zonal_stats('tests/data/polygons.shp',
... 'tests/data/slope_classes.tif',
... stats="count",
... raster_out=True)[0].keys()
['count', 'mini_raster_affine', 'mini_raster_array', 'mini_raster_nodata']
Notice we have three additional keys:
* ``mini_raster_array``: The clipped and masked numpy array
* ``mini_raster_affine``: transformation as an Affine object
* ``mini_raster_nodata``: The nodata value
Keep in mind that having ndarrays in your stats dictionary means it is more difficult to serialize to json and other text formats.
Design Goals¶
rasterstats
aims to do only one thing well: getting information from rasters based on vector geometry.
This module doesn’t support coordinate reprojection, raster re-sampling, geometry manipulations or any other
geospatial data transformations as those are better left to other Python packages. To the extent possible,
data input is handled by fiona
and rasterio
, though there are some wrapper functions for IO to
maintain usability. Where interoperability between packages is needed, loose coupling, simple python data structure
and standard interfaces like GeoJSON are employed to keep the core library lean.
History¶
This work grew out of a need to have a native python implementation (based on numpy) for zonal statisics.
I had been using starspan, a C++
command line tool, as well as GRASS’s r.statistics for many years.
They were suitable for offline analyses but were rather clunky to deploy in a large python application.
In 2013, I implemented a proof-of-concept zonal stats function which eventually became rasterstats
. It has
been in production in several large python web applications ever since, replacing the starspan wrapper madrona.raster_stats.