.. index::
   single: tutorial, PyHIS
   single: PyHIS tutorial


************
PyHIS Basics
************

Introduction
============

This tutorial demonstrates how to retrieve data from the the CUAHSI-HIS system using PyHIS. It will
cover the basics of data access, available convinience functions and some simple scripts using the toolkit.

This tutorial requires PyHIS to be installed on your computer (http://pypi.python.org/pypi/pyhis).

Screenshots and examples in this tutorial are shown using the IPython (http://ipython.org/) interactive python
prompt for convinience but IPython is not required to use PyHIS. A Python code editor or IDE is recommended for
writing scripts or viewing code. This tutorial follows the convention below. A basic familiarity with Python is
useful but the examples have been chosen for readibility and a non python user should be able to follow the tutorial.

Code Snippet to be run in IPython or Python prompt::

  a = 5
  b = 9
  print a + b

  # Output:
  # In [1]: a = 5
  # In [2]: b = 9
  # In [3]: print a + b
  # 14

The line starting with a ``#`` are comment lines. The section following ``# Output:`` demonstrates the expected output of the code snippet.
You can follow along by cutting and pasting the each code snippet into the IPython prompt.

Installing PyHIS
================

PyHIS is available on the Python Package Index and can be installed on most systems using either ``easy_install pyhis``
or ``pip install pyhis``. PyHIS uses several scientific python plotting and analysis libraries. Although easy_install and
pip will attempt to install these automatically it is better to use a scientific python distribution like
PythonXY (http://www.pythonxy.com, Windows Only)) or the Enthought Python Distribution - EPD Free
(http://enthought.com/products/epd_free.php, Windows/Linux/Mac OS X) prior to installing PyHIS.


Initiating PyHIS
================

To start off lets import pyhis into python::

  import pyhis

  # Output:
  #
  # In [1]: import pyhis
  # cache initialized with database: sqlite:////var/folders/j0/3qhzgkzn7xgdzh0488y4_b340000gn/T/pyhis_cache.db

This command makes pyhis and all of its abilities available to python. If you look at the output above you can see
that by default PyHIS has created a sqlite database to automatically cache (store locally) any data and metadata
you access with PyHIS.

If you would like to store the data somewhere else you can provide a path and name for your database::

  pyhis.cache.init_cache(cache_database_uri='mycache.db')

  # Output:
  #
  # In [2]: pyhis.cache.init_cache(cache_database_uri='mycache.db')
  # cache initialized with database: sqlite:///mycache.db

This stores downloaded data in the current directory in a sqlite file called mycache.db.
You can also turn off caching for individual services or use different database backends. This is not covered here.

Dude where is my WSDL!
======================

Accessing CUAHSI-HIS data services requires knowing the WSDL (Web Service Definition Language)
of the service you are interested in. This is a very complicated way of saying you need to now the
special web link that provides the data. WSDL's look like this
http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL . Clicking on that link takes you to a page
full of gobbledy gook XML that is impossible to understand. This is what PyHIS needs to make the magic happen.

CUAHSI-HIS maintains a list of registered HIS data services on at HIS Central (http://hiscentral.cuahsi.org/). These
can be viewed from within PyHIS by using the following function::

  list_of_services = pyhis.his_central.services()
  len(list_of_services) #number of available services
  list_of_services #to see the list

  # Output:
  # In [3]: list_of_services = pyhis.his_central.services()
  # In [4]: len(list_of_services)
  # 76
  # In [5]: list_of_services
  # [(u'USGS: NWIS Daily Values',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL',
  # 'NWISDV'),
  # (u'USGS: NWIS Instantaneous Irregular Data',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/Data.asmx?WSDL',
  # 'NWISIID'),
  # (u'USGS: NWIS Unit Values',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL',
  # 'NWISUV'),
  # (u'EPA: EPA STORET',
  # 'http://river.sdsc.edu/wateroneflow/EPA/cuahsi_1_0.asmx?WSDL',
  # 'EPA'),
  # (u'Chesapeake Bay Information Management System: Chesapeake Bay Information Management System',
  # 'http://eddy.ccny.cuny.edu/CIMS/cuahsi_1_1.asmx?WSDL',
  # 'CIMS'),
  # ...
  # ]

The above command returns a lust of all available services. You can also request a list of services with data
available within a bounding box::

  list_of_texas_services = pyhis.his_central.services(xmin=-106.7, ymin=25.5, xmax=-93.4, ymax=36.6)
  len(list_of_texas_services) #number of services available within bounding box
  list_of_services #to see the list

  # Output:
  # In [6]: list_of_texas_services = pyhis.his_central.services()
  # In [7]: len(list_of_texas_services)
  # 27
  # In [8]: list_of_texas_services
  # [(u'USGS: NWIS Daily Values',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL',
  # 'NWISDV'),
  # (u'USGS: NWIS Instantaneous Irregular Data',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/Data.asmx?WSDL',
  # 'NWISIID'),
  # (u'USGS: NWIS Unit Values',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL',
  # 'NWISUV'),
  # (u'EPA: EPA STORET',
  # 'http://river.sdsc.edu/wateroneflow/EPA/cuahsi_1_0.asmx?WSDL',
  # 'EPA'),
  # (u'USGS: NWIS Ground Water Level',
  # 'http://river.sdsc.edu/wateroneflow/NWIS/Groundwater.asmx?WSDL',
  # 'NWISGW'),
  # (u'Texas Instream Flow Program: Texas Instream Flow, Lower Sabine',
  # 'http://his.crwr.utexas.edu/SabineBio/cuahsi_1_0.asmx?WSDL',
  # 'TIFP_LowerSabine'),
  # (u'Texas Instream Flow Program: Texas Instream Flow, Lower San Antonio',
  # 'http://his.crwr.utexas.edu/SanAntonioBio/cuahsi_1_0.asmx?WSDL',
  # 'TIFP_LowerSanAntonio'),
  # ...
  # ]

The services a returned as a list of tuples each containing the name of the service, the WSDL to access
the service and the network name. PyHIS is not limited to the services registered at HIS Central, any
WaterOneFlow compliant data service can be accessed as long as your know the WSDL url.

Once you have the WSDL of a service you want to access data from, the service needs to be initialized in pyhis.
For the following several examples we will use the USGS NWIS Unit Values service.

Initialize the service::

  wsdl_url = 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL'
  nwisuv = pyhis.Service(wsdl_url)
  nwisuv #lets see what this is ...

  # Output:
  #
  #
  # In [9]: wsdl_url = 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL'
  # In [10]: nwisuv = pyhis.Service(wsdl_url)
  # In [11]: nwisuv
  # <Service: http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL>

This creates a pyhis ``Service`` object called nwisuv that knows everything it needs to about the USGS NWIS service.

Alright already! Lets Get Some Data.
====================================

Now that we have a pyhis Service object ``nwisuv``, lets see what functions are available. If you are using IPython,
type ``nwisuv.`` and press the tab button::

  In [12]: nwisuv.
  nwisuv.default_network             nwisuv.get_site                    nwisuv.sites
  nwisuv.description                 nwisuv.get_sites_within_polygon    nwisuv.sites_array
  nwisuv.generate_sites_array        nwisuv.get_sites_within_radius_r   nwisuv.suds_client
  nwisuv.get_all_sites               nwisuv.get_sites_within_shapefile  nwisuv.url

You will see a list of functions and attributes attached to the ``nwisuv`` object. Lets see what some of these
do (Note: Some of these are internal functions that will be hidden in furture versions of PyHIS)::

  nwisuv.url
  # Output: 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL'
  # i.e. the original url we specified

  nwisuv.sites
  # Out: making GetSites query...
  # Out:
  {u'01010000': <Site: St. John River at Ninemile Bridge, Maine [01010000]>,
   u'01010070': <Site: Big Black River near Depot Mtn, Maine [01010070]>,
   u'01010500': <Site: St. John River at Dickey, Maine [01010500]>,
   u'01011000': <Site: Allagash River near Allagash, Maine [01011000]>,
   u'01011500': <Site: St. Francis River near Connors, New Brunswick [01011500]>,
   u'01013500': <Site: Fish River near Fort Kent, Maine [01013500]>,
   u'01014000': <Site: St. John River below Fish R, at Fort Kent, Maine [01014000]>,
   u'01015800': <Site: Aroostook River near Masardis, Maine [01015800]>,
   u'01017000': <Site: Aroostook River at Washburn, Maine [01017000]>,
   u'01017060': <Site: Hardwood Brook below Glidden Brk nr Caribou, Mai [01017060]>,
   u'01017290': <Site: Little Madawaska River at Caribou, Maine [01017290]>,
   ...
   }
  len(nwisuv.sites)
  # Out: 11758

Note: The first time you type ``nwisuv.sites`` pyhis has to connect to the USGS NWIS Unit Values service
and download the list of sites by making a web service request. Hence you will see a message saying
'making GetSites query'. Retrieving and parsing this data can take a few minutes for some of the larger
datasets like the USGS. The next time you use nwisuv.sites or any function that needs the list of sites,
PyHIS uses its automated local cache. so the response is immediate. This is true even if you close the
Python window and open a new prompt. PyHIS has options to force refreshing of the cache on demand.

Lets find some sites within the Austin, TX area. PyHIS has three convinience functions you can use to
narrow down the list of sites you are interested in. These are: ``get_sites_within_polygon``, ``get_sites_within_shapefile``
and ``get_sites_within_radius_r``. These do basically what you would expect from the name::

  travis_sites = nwis_uv.get_sites_within_shapefile('travis_county.shp')
  len(travis_sites)
  # Out: 27
  travis_sites
  # Out:  {'08154500/agency=TX071': <Site: LCRA Lk Travis nr Austin, TX [08154500/agency=TX071]>,
  #        '08154700': <Site: Bull Ck at Loop 360 nr Austin, TX [08154700]>,
  #        '08154900/agency=TX071': <Site: LCRA Lk Austin at Austin, TX [08154900/agency=TX071]>,
  #        '08155200': <Site: Barton Ck at SH 71 nr Oak Hill, TX [08155200]>,
  #        '08155240': <Site: Barton Ck at Lost Ck Blvd nr Austin, TX [08155240]>,
  #        '08155300': <Site: Barton Ck at Loop 360, Austin, TX [08155300]>,
  #        '08155400': <Site: Barton Ck abv Barton Spgs at Austin, TX [08155400]>,
  #        '08155500': <Site: Barton Spgs at Austin, TX [08155500]>,
  #        '08155541': <Site: W Bouldin Ck at Oltorf Rd, Austin, TX [08155541]>,
  #        '08156675': <Site: Shoal Ck at Silverway Dr, Austin, TX [08156675]>,
  #        '08156800': <Site: Shoal Ck at W 12th St, Austin, TX [08156800]>,
  #        '08156910': <Site: Waller Ck at Koenig Lane, Austin, TX [08156910]>,
  #        '08158000': <Site: Colorado Rv at Austin, TX [08158000]>,
  #        '08158030': <Site: Boggy Ck at Manor Rd, Austin, TX [08158030]>,
  #        '08158035': <Site: Boggy Ck at Webberville Rd, Austin, TX [08158035]>,
  #        '08158045': <Site: Ft Br Boggy Ck at Manor Rd, Austin, TX [08158045]>,
  #        '08158200': <Site: Walnut Ck at Dessau Rd, Austin, TX [08158200]>,
  #        '08158380': <Site: Little Walnut Ck at Georgian Dr, Austin, TX [08158380]>,
  #        '08158600': <Site: Walnut Ck at Webberville Rd, Austin, TX [08158600]>,
  #        '08158827': <Site: Onion Ck at Twin Creeks Rd nr Manchaca, TX [08158827]>,
  #        '08158840': <Site: Slaughter Ck at FM 1826 nr Austin, TX [08158840]>,
  #        '08158860': <Site: Slaughter Ck at FM 2304 nr Austin, TX [08158860]>,
  #        '08158920': <Site: Williamson Ck at Oak Hill, TX [08158920]>,
  #        '08158927': <Site: Kincheon Br at William Cannon Blvd, Austin, TX [08158927]>,
  #        '08158930': <Site: Williamson Ck at Manchaca Rd, Austin, TX [08158930]>,
  #        '08158970': <Site: Williamson Ck at Jimmy Clay Rd, Austin, TX [08158970]>,
  #        '08159000': <Site: Onion Ck at US Hwy 183, Austin, TX [08159000]>}  

Lets look at one of the sites::

  nwisuv.sites['08158000']
  # Out: <Site: Colorado Rv at Austin, TX [08158000]>

This is a PyHIS ``Site`` object describing the USGS Gage 08158000 on the Colorado River at Austin TX.
Lets see what we can find out about this gage::

  mysite = nwisuv.sites['08158000']
  mysite. #hit tab
  # Out: mysite.code                mysite.latitude            mysite.network             mysite.timeseries
         mysite.dataframe           mysite.longitude           mysite.service
         mysite.id                  mysite.name                mysite.site_info_response

Lets look at some of these::

  mysite.name
  # Out: u'Colorado Rv at Austin, TX'

  mysite.code
  # Out: u'08158000'

  mysite.latitude
  # Out: 30.244653701782227

  mysite.longitude
  # Out: -97.69445037841797

  mysite.service
  # Out: <Service: http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL>

  mysite.network
  # Out: u'NWISUV'

So the PyHIS site object has pretty much all the metadata you should need about the site you are accessing.

Yada, yada, yada... where is the actual data.
=============================================

Patience, we are there::

  mysite.timeseries
  #Out: making GetSiteInfo request for "NWISUV:08158000"...
  #Out: {00060: <TimeSeries: Discharge, cubic feet per second (2011-07-18 07:51:46 - 2011-11-15 07:51:46)>,
         00065: <TimeSeries: Gage height, feet (2011-07-18 07:51:46 - 2011-11-15 07:51:46)>}

Ah hah! This USGS gage has two available time series datasets; Discharge and Gage height. Also summarized are
the available date ranges of the data. ``00060`` and ``00065`` are the parameter codes for these timeseries. In
the background PyHIS has made a GetSiteInfo webservice request to get the data.

Note: In practice, time period and value counts are not very reliable from service to service depending on how
they have been generated by the data provider. It is better to treat them as estimates.

Lets look at some of the metadata about the discharge timeseries::

  mydischarge = mysite.timeseries['00060']
  mydischarge. #press tab
  # Out: mydischarge.begin_datetime         mydischarge.quality_control_level  mydischarge.value_count
         mydischarge.data                   mydischarge.quantity               mydischarge.variable
         mydischarge.end_datetime           mydischarge.series
         mydischarge.method                 mydischarge.site

  mydischarge.value_count
  # Out: 11520

So a time

Lets get the discharge data for the entire time period::

  mydata = mydischarge.data
  #   --- OR ---
  # mydata = mysite.timeseries['00060'].data #i.e. You can string all the commands together.
  #
  # Out: making timeseries request for "NWISUV:08158000:00060 (None - None)"...
         /Users/dharhas/work/pyhis/pyhis/waterml.py:128: UserWarning: Unit conversion not available for 00060: UNKNOWN [cfs]
         (variable_code, quantity, unit_code))
         /Users/dharhas/work/pyhis/pyhis/cache.py:816: UserWarning: value_count (11520) doesn't match number of values (11415) for Colorado Rv at Austin, TX:00060
         cached_timeseries.variable.code))

Data has been downloaded and placed a ``pandas`` time series object. Pandas (http://pandas.sourceforge.net/)
is a powerful Python data analysis toolkit. The data has also been cached. Next time this particular data is
requested by default only new data will be retrieved from the USGS service. Previously retrieved data will be
read from the local cache.

Lets look at the data::

  mydata
  # Out:
  #  2011-07-18 09:00:00    126.0
  #  2011-07-18 09:15:00    126.0
  #  2011-07-18 09:30:00    139.0
  #  2011-07-18 09:45:00    216.0
  #  2011-07-18 10:00:00    372.0
  #  2011-07-18 10:15:00    553.0
  #  2011-07-18 10:30:00    714.0
  #  ...
  #  2011-11-15 08:00:00    101.0
  #  2011-11-15 08:15:00    105.0
  #  2011-11-15 08:30:00    105.0
  #  2011-11-15 08:45:00    108.0
  #  2011-11-15 09:00:00    108.0
  #  2011-11-15 09:15:00    108.0
  #  2011-11-15 09:30:00    108.0
  #  2011-11-15 09:45:00    108.0
  #  length: 11415

  mydata. #press tab
  # Out: mydata.
  #       Display all 131 possibilities? (y or n)
  #       mydata.T                  mydata.data               mydata.map                mydata.shift
  #       mydata.add                mydata.describe           mydata.max                mydata.size
  #       mydata.all                mydata.diagonal           mydata.mean               mydata.skew
  #       mydata.any                mydata.diff               mydata.median             mydata.sort
  #       mydata.append             mydata.div                mydata.merge              mydata.sort_index
  #       mydata.apply              mydata.dot                mydata.min                mydata.sortlevel
  #       mydata.applymap           mydata.drop               mydata.mul                mydata.squeeze
  #       mydata.argmax             mydata.dropna             mydata.nbytes             mydata.std
  #       mydata.argmin             mydata.dtype              mydata.ndim               mydata.strides
  #       mydata.argsort            mydata.dump               mydata.newbyteorder       mydata.sub
  #       mydata.asOf               mydata.dumps              mydata.nonzero            mydata.sum
  #       mydata.asfreq             mydata.fill               mydata.order              mydata.swapaxes
  #       mydata.asof               mydata.fillna             mydata.plot               mydata.swaplevel
  #       mydata.astype             mydata.first_valid_index  mydata.prod               mydata.take
  #       mydata.autocorr           mydata.flags              mydata.ptp                mydata.toCSV
  #       mydata.base               mydata.flat               mydata.put                mydata.toDict
  #       mydata.byteswap           mydata.flatten            mydata.quantile           mydata.toString
  #       mydata.choose             mydata.fromValue          mydata.ravel              mydata.to_csv
  #       mydata.clip               mydata.get                mydata.real               mydata.to_dict
  #       mydata.clip_lower         mydata.getfield           mydata.reindex            mydata.to_sparse
  #       mydata.clip_upper         mydata.groupby            mydata.reindex_like       mydata.tofile
  #       mydata.combine            mydata.hist               mydata.rename             mydata.tolist
  #       mydata.combineFirst       mydata.imag               mydata.repeat             mydata.tostring
  #       mydata.combine_first      mydata.index              mydata.reshape            mydata.trace
  #       mydata.compress           mydata.interpolate        mydata.resize             mydata.transpose
  #       mydata.conj               mydata.item               mydata.round              mydata.truncate
  #       mydata.conjugate          mydata.itemset            mydata.save               mydata.unstack
  #       mydata.copy               mydata.itemsize           mydata.searchsorted       mydata.valid
  #       mydata.corr               mydata.iteritems          mydata.select             mydata.values
  #       mydata.count              mydata.ix                 mydata.setasflat          mydata.var
  #       mydata.ctypes             mydata.keys               mydata.setfield           mydata.view
  #       mydata.cumprod            mydata.last_valid_index   mydata.setflags           mydata.weekday
  #       mydata.cumsum             mydata.load               mydata.shape


As you can see there is a pretty long List of functions available for a ``pandas`` timeseries object. Lets try
a few::

  mydata.min()
  # Out: 52.0

  mydata.max()
  # Out: 2850.0

  mydata.median()
  # Out: 548.0

  mydata.std()
  # Out: 774.01376212573985

  # or just look at all the basic stats
  mydata.describe()
  # Out: count    11415.0
         mean     872.066579063
         std      774.013762126
         min      52.0
         10%      108.0
         50%      548.0
         90%      2030.0
         max      2850.0

  mydata.cumsum()
  # Out: 2011-07-18 09:00:00    126.0
         2011-07-18 09:15:00    252.0
         2011-07-18 09:30:00    391.0
         2011-07-18 09:45:00    607.0
         2011-07-18 10:00:00    979.0
         2011-07-18 10:15:00    1532.0
         2011-07-18 10:30:00    2246.0
         ...
         2011-11-15 08:30:00    9954100.0
         2011-11-15 08:45:00    9954208.0
         2011-11-15 09:00:00    9954316.0
         2011-11-15 09:15:00    9954424.0
         2011-11-15 09:30:00    9954532.0
         2011-11-15 09:45:00    9954640.0
         length: 11415

  # Plot data
  mydata.plot()

  # Plot Cumulative Sum
  mydata.cumsum().plot()

  # save data to csv file
  mydata.to_csv('mydata.csv')

Getting more complicated. Lets write a script
=============================================

Look at tut_example1.py::

  """
  Example script that returns a list of USGS gages within Travis county area whose discharge is below
  20th percentile.
  """

  #Initialize pyhis
  import pyhis

  #Initialize some other python modules for plotting etc
  import matplotlib.pyplot as plt

  # Connect to the USGS NWIS Unit values and Daily values data services.
  # The daily values service will be used to calculate historical percentiles
  # and the unit values service provides 15 mn data

  nwis_dv = pyhis.Service('http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL')
  nwis_uv = pyhis.Service('http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL')

  # get sites within Travis county
  travis_sites = nwis_uv.get_sites_within_shapefile('travis_county.shp')

  # target percentile (this could be made into a input parameter)
  target_percentile = 0.2 #(i.e. 20%)

  # start with an empty list
  list_of_critical_gages = [] 

  # loop through the sites in Travis County
  for site in travis_sites:
      print 'processing site: ', site
      #check if there is discharge data atthis site
      if '00060' in nwis_uv.sites[site].timeseries:
          print 'Discharge data available'
          # calculate the target quantile from historical daily value data
          threshold = nwis_dv.sites[site].timeseries['00060/DataType=Average'].data.quantile(target_percentile)
          print 'threshold value: ' , threshold
        
          #if the latest downloaded data is below threshold then add to list of critical gages
          if nwis_uv.sites[site].timeseries['00060'].data[-1] < threshold:
              print 'adding gage to critical gage list: ', site
              list_of_critical_gages.append(nwis_uv.sites[site])

  #plot discharge at the critical gages
  plt.figure()
  for gage in list_of_critical_gages:
      gage.timeseries['00060'].data.plot(label=gage.name)

  plt.legend()
  plt.ylabel = 'Discharge (cfs)'
  plt.title('Critical Gages in Travis County')
  plt.savefig('travis_county_critical_gages.png')