.. index:: single: tutorial, PyHIS single: PyHIS tutorial ************ PyHIS Basics ************ Introduction ============ This tutorial demonstrates how to retrieve data from the the CUAHSI-HIS system using PyHIS. It will cover the basics of data access, available convinience functions and some simple scripts using the toolkit. This tutorial requires PyHIS to be installed on your computer (http://pypi.python.org/pypi/pyhis). Screenshots and examples in this tutorial are shown using the IPython (http://ipython.org/) interactive python prompt for convinience but IPython is not required to use PyHIS. A Python code editor or IDE is recommended for writing scripts or viewing code. This tutorial follows the convention below. A basic familiarity with Python is useful but the examples have been chosen for readibility and a non python user should be able to follow the tutorial. Code Snippet to be run in IPython or Python prompt:: a = 5 b = 9 print a + b # Output: # In [1]: a = 5 # In [2]: b = 9 # In [3]: print a + b # 14 The line starting with a ``#`` are comment lines. The section following ``# Output:`` demonstrates the expected output of the code snippet. You can follow along by cutting and pasting the each code snippet into the IPython prompt. Installing PyHIS ================ PyHIS is available on the Python Package Index and can be installed on most systems using either ``easy_install pyhis`` or ``pip install pyhis``. PyHIS uses several scientific python plotting and analysis libraries. Although easy_install and pip will attempt to install these automatically it is better to use a scientific python distribution like PythonXY (http://www.pythonxy.com, Windows Only)) or the Enthought Python Distribution - EPD Free (http://enthought.com/products/epd_free.php, Windows/Linux/Mac OS X) prior to installing PyHIS. Initiating PyHIS ================ To start off lets import pyhis into python:: import pyhis # Output: # # In [1]: import pyhis # cache initialized with database: sqlite:////var/folders/j0/3qhzgkzn7xgdzh0488y4_b340000gn/T/pyhis_cache.db This command makes pyhis and all of its abilities available to python. If you look at the output above you can see that by default PyHIS has created a sqlite database to automatically cache (store locally) any data and metadata you access with PyHIS. If you would like to store the data somewhere else you can provide a path and name for your database:: pyhis.cache.init_cache(cache_database_uri='mycache.db') # Output: # # In [2]: pyhis.cache.init_cache(cache_database_uri='mycache.db') # cache initialized with database: sqlite:///mycache.db This stores downloaded data in the current directory in a sqlite file called mycache.db. You can also turn off caching for individual services or use different database backends. This is not covered here. Dude where is my WSDL! ====================== Accessing CUAHSI-HIS data services requires knowing the WSDL (Web Service Definition Language) of the service you are interested in. This is a very complicated way of saying you need to now the special web link that provides the data. WSDL's look like this http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL . Clicking on that link takes you to a page full of gobbledy gook XML that is impossible to understand. This is what PyHIS needs to make the magic happen. CUAHSI-HIS maintains a list of registered HIS data services on at HIS Central (http://hiscentral.cuahsi.org/). These can be viewed from within PyHIS by using the following function:: list_of_services = pyhis.his_central.services() len(list_of_services) #number of available services list_of_services #to see the list # Output: # In [3]: list_of_services = pyhis.his_central.services() # In [4]: len(list_of_services) # 76 # In [5]: list_of_services # [(u'USGS: NWIS Daily Values', # 'http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL', # 'NWISDV'), # (u'USGS: NWIS Instantaneous Irregular Data', # 'http://river.sdsc.edu/wateroneflow/NWIS/Data.asmx?WSDL', # 'NWISIID'), # (u'USGS: NWIS Unit Values', # 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL', # 'NWISUV'), # (u'EPA: EPA STORET', # 'http://river.sdsc.edu/wateroneflow/EPA/cuahsi_1_0.asmx?WSDL', # 'EPA'), # (u'Chesapeake Bay Information Management System: Chesapeake Bay Information Management System', # 'http://eddy.ccny.cuny.edu/CIMS/cuahsi_1_1.asmx?WSDL', # 'CIMS'), # ... # ] The above command returns a lust of all available services. You can also request a list of services with data available within a bounding box:: list_of_texas_services = pyhis.his_central.services(xmin=-106.7, ymin=25.5, xmax=-93.4, ymax=36.6) len(list_of_texas_services) #number of services available within bounding box list_of_services #to see the list # Output: # In [6]: list_of_texas_services = pyhis.his_central.services() # In [7]: len(list_of_texas_services) # 27 # In [8]: list_of_texas_services # [(u'USGS: NWIS Daily Values', # 'http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL', # 'NWISDV'), # (u'USGS: NWIS Instantaneous Irregular Data', # 'http://river.sdsc.edu/wateroneflow/NWIS/Data.asmx?WSDL', # 'NWISIID'), # (u'USGS: NWIS Unit Values', # 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL', # 'NWISUV'), # (u'EPA: EPA STORET', # 'http://river.sdsc.edu/wateroneflow/EPA/cuahsi_1_0.asmx?WSDL', # 'EPA'), # (u'USGS: NWIS Ground Water Level', # 'http://river.sdsc.edu/wateroneflow/NWIS/Groundwater.asmx?WSDL', # 'NWISGW'), # (u'Texas Instream Flow Program: Texas Instream Flow, Lower Sabine', # 'http://his.crwr.utexas.edu/SabineBio/cuahsi_1_0.asmx?WSDL', # 'TIFP_LowerSabine'), # (u'Texas Instream Flow Program: Texas Instream Flow, Lower San Antonio', # 'http://his.crwr.utexas.edu/SanAntonioBio/cuahsi_1_0.asmx?WSDL', # 'TIFP_LowerSanAntonio'), # ... # ] The services a returned as a list of tuples each containing the name of the service, the WSDL to access the service and the network name. PyHIS is not limited to the services registered at HIS Central, any WaterOneFlow compliant data service can be accessed as long as your know the WSDL url. Once you have the WSDL of a service you want to access data from, the service needs to be initialized in pyhis. For the following several examples we will use the USGS NWIS Unit Values service. Initialize the service:: wsdl_url = 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL' nwisuv = pyhis.Service(wsdl_url) nwisuv #lets see what this is ... # Output: # # # In [9]: wsdl_url = 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL' # In [10]: nwisuv = pyhis.Service(wsdl_url) # In [11]: nwisuv # This creates a pyhis ``Service`` object called nwisuv that knows everything it needs to about the USGS NWIS service. Alright already! Lets Get Some Data. ==================================== Now that we have a pyhis Service object ``nwisuv``, lets see what functions are available. If you are using IPython, type ``nwisuv.`` and press the tab button:: In [12]: nwisuv. nwisuv.default_network nwisuv.get_site nwisuv.sites nwisuv.description nwisuv.get_sites_within_polygon nwisuv.sites_array nwisuv.generate_sites_array nwisuv.get_sites_within_radius_r nwisuv.suds_client nwisuv.get_all_sites nwisuv.get_sites_within_shapefile nwisuv.url You will see a list of functions and attributes attached to the ``nwisuv`` object. Lets see what some of these do (Note: Some of these are internal functions that will be hidden in furture versions of PyHIS):: nwisuv.url # Output: 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL' # i.e. the original url we specified nwisuv.sites # Out: making GetSites query... # Out: {u'01010000': , u'01010070': , u'01010500': , u'01011000': , u'01011500': , u'01013500': , u'01014000': , u'01015800': , u'01017000': , u'01017060': , u'01017290': , ... } len(nwisuv.sites) # Out: 11758 Note: The first time you type ``nwisuv.sites`` pyhis has to connect to the USGS NWIS Unit Values service and download the list of sites by making a web service request. Hence you will see a message saying 'making GetSites query'. Retrieving and parsing this data can take a few minutes for some of the larger datasets like the USGS. The next time you use nwisuv.sites or any function that needs the list of sites, PyHIS uses its automated local cache. so the response is immediate. This is true even if you close the Python window and open a new prompt. PyHIS has options to force refreshing of the cache on demand. Lets find some sites within the Austin, TX area. PyHIS has three convinience functions you can use to narrow down the list of sites you are interested in. These are: ``get_sites_within_polygon``, ``get_sites_within_shapefile`` and ``get_sites_within_radius_r``. These do basically what you would expect from the name:: travis_sites = nwis_uv.get_sites_within_shapefile('travis_county.shp') len(travis_sites) # Out: 27 travis_sites # Out: {'08154500/agency=TX071': , # '08154700': , # '08154900/agency=TX071': , # '08155200': , # '08155240': , # '08155300': , # '08155400': , # '08155500': , # '08155541': , # '08156675': , # '08156800': , # '08156910': , # '08158000': , # '08158030': , # '08158035': , # '08158045': , # '08158200': , # '08158380': , # '08158600': , # '08158827': , # '08158840': , # '08158860': , # '08158920': , # '08158927': , # '08158930': , # '08158970': , # '08159000': } Lets look at one of the sites:: nwisuv.sites['08158000'] # Out: This is a PyHIS ``Site`` object describing the USGS Gage 08158000 on the Colorado River at Austin TX. Lets see what we can find out about this gage:: mysite = nwisuv.sites['08158000'] mysite. #hit tab # Out: mysite.code mysite.latitude mysite.network mysite.timeseries mysite.dataframe mysite.longitude mysite.service mysite.id mysite.name mysite.site_info_response Lets look at some of these:: mysite.name # Out: u'Colorado Rv at Austin, TX' mysite.code # Out: u'08158000' mysite.latitude # Out: 30.244653701782227 mysite.longitude # Out: -97.69445037841797 mysite.service # Out: mysite.network # Out: u'NWISUV' So the PyHIS site object has pretty much all the metadata you should need about the site you are accessing. Yada, yada, yada... where is the actual data. ============================================= Patience, we are there:: mysite.timeseries #Out: making GetSiteInfo request for "NWISUV:08158000"... #Out: {00060: , 00065: } Ah hah! This USGS gage has two available time series datasets; Discharge and Gage height. Also summarized are the available date ranges of the data. ``00060`` and ``00065`` are the parameter codes for these timeseries. In the background PyHIS has made a GetSiteInfo webservice request to get the data. Note: In practice, time period and value counts are not very reliable from service to service depending on how they have been generated by the data provider. It is better to treat them as estimates. Lets look at some of the metadata about the discharge timeseries:: mydischarge = mysite.timeseries['00060'] mydischarge. #press tab # Out: mydischarge.begin_datetime mydischarge.quality_control_level mydischarge.value_count mydischarge.data mydischarge.quantity mydischarge.variable mydischarge.end_datetime mydischarge.series mydischarge.method mydischarge.site mydischarge.value_count # Out: 11520 So a time Lets get the discharge data for the entire time period:: mydata = mydischarge.data # --- OR --- # mydata = mysite.timeseries['00060'].data #i.e. You can string all the commands together. # # Out: making timeseries request for "NWISUV:08158000:00060 (None - None)"... /Users/dharhas/work/pyhis/pyhis/waterml.py:128: UserWarning: Unit conversion not available for 00060: UNKNOWN [cfs] (variable_code, quantity, unit_code)) /Users/dharhas/work/pyhis/pyhis/cache.py:816: UserWarning: value_count (11520) doesn't match number of values (11415) for Colorado Rv at Austin, TX:00060 cached_timeseries.variable.code)) Data has been downloaded and placed a ``pandas`` time series object. Pandas (http://pandas.sourceforge.net/) is a powerful Python data analysis toolkit. The data has also been cached. Next time this particular data is requested by default only new data will be retrieved from the USGS service. Previously retrieved data will be read from the local cache. Lets look at the data:: mydata # Out: # 2011-07-18 09:00:00 126.0 # 2011-07-18 09:15:00 126.0 # 2011-07-18 09:30:00 139.0 # 2011-07-18 09:45:00 216.0 # 2011-07-18 10:00:00 372.0 # 2011-07-18 10:15:00 553.0 # 2011-07-18 10:30:00 714.0 # ... # 2011-11-15 08:00:00 101.0 # 2011-11-15 08:15:00 105.0 # 2011-11-15 08:30:00 105.0 # 2011-11-15 08:45:00 108.0 # 2011-11-15 09:00:00 108.0 # 2011-11-15 09:15:00 108.0 # 2011-11-15 09:30:00 108.0 # 2011-11-15 09:45:00 108.0 # length: 11415 mydata. #press tab # Out: mydata. # Display all 131 possibilities? (y or n) # mydata.T mydata.data mydata.map mydata.shift # mydata.add mydata.describe mydata.max mydata.size # mydata.all mydata.diagonal mydata.mean mydata.skew # mydata.any mydata.diff mydata.median mydata.sort # mydata.append mydata.div mydata.merge mydata.sort_index # mydata.apply mydata.dot mydata.min mydata.sortlevel # mydata.applymap mydata.drop mydata.mul mydata.squeeze # mydata.argmax mydata.dropna mydata.nbytes mydata.std # mydata.argmin mydata.dtype mydata.ndim mydata.strides # mydata.argsort mydata.dump mydata.newbyteorder mydata.sub # mydata.asOf mydata.dumps mydata.nonzero mydata.sum # mydata.asfreq mydata.fill mydata.order mydata.swapaxes # mydata.asof mydata.fillna mydata.plot mydata.swaplevel # mydata.astype mydata.first_valid_index mydata.prod mydata.take # mydata.autocorr mydata.flags mydata.ptp mydata.toCSV # mydata.base mydata.flat mydata.put mydata.toDict # mydata.byteswap mydata.flatten mydata.quantile mydata.toString # mydata.choose mydata.fromValue mydata.ravel mydata.to_csv # mydata.clip mydata.get mydata.real mydata.to_dict # mydata.clip_lower mydata.getfield mydata.reindex mydata.to_sparse # mydata.clip_upper mydata.groupby mydata.reindex_like mydata.tofile # mydata.combine mydata.hist mydata.rename mydata.tolist # mydata.combineFirst mydata.imag mydata.repeat mydata.tostring # mydata.combine_first mydata.index mydata.reshape mydata.trace # mydata.compress mydata.interpolate mydata.resize mydata.transpose # mydata.conj mydata.item mydata.round mydata.truncate # mydata.conjugate mydata.itemset mydata.save mydata.unstack # mydata.copy mydata.itemsize mydata.searchsorted mydata.valid # mydata.corr mydata.iteritems mydata.select mydata.values # mydata.count mydata.ix mydata.setasflat mydata.var # mydata.ctypes mydata.keys mydata.setfield mydata.view # mydata.cumprod mydata.last_valid_index mydata.setflags mydata.weekday # mydata.cumsum mydata.load mydata.shape As you can see there is a pretty long List of functions available for a ``pandas`` timeseries object. Lets try a few:: mydata.min() # Out: 52.0 mydata.max() # Out: 2850.0 mydata.median() # Out: 548.0 mydata.std() # Out: 774.01376212573985 # or just look at all the basic stats mydata.describe() # Out: count 11415.0 mean 872.066579063 std 774.013762126 min 52.0 10% 108.0 50% 548.0 90% 2030.0 max 2850.0 mydata.cumsum() # Out: 2011-07-18 09:00:00 126.0 2011-07-18 09:15:00 252.0 2011-07-18 09:30:00 391.0 2011-07-18 09:45:00 607.0 2011-07-18 10:00:00 979.0 2011-07-18 10:15:00 1532.0 2011-07-18 10:30:00 2246.0 ... 2011-11-15 08:30:00 9954100.0 2011-11-15 08:45:00 9954208.0 2011-11-15 09:00:00 9954316.0 2011-11-15 09:15:00 9954424.0 2011-11-15 09:30:00 9954532.0 2011-11-15 09:45:00 9954640.0 length: 11415 # Plot data mydata.plot() # Plot Cumulative Sum mydata.cumsum().plot() # save data to csv file mydata.to_csv('mydata.csv') Getting more complicated. Lets write a script ============================================= Look at tut_example1.py:: """ Example script that returns a list of USGS gages within Travis county area whose discharge is below 20th percentile. """ #Initialize pyhis import pyhis #Initialize some other python modules for plotting etc import matplotlib.pyplot as plt # Connect to the USGS NWIS Unit values and Daily values data services. # The daily values service will be used to calculate historical percentiles # and the unit values service provides 15 mn data nwis_dv = pyhis.Service('http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL') nwis_uv = pyhis.Service('http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL') # get sites within Travis county travis_sites = nwis_uv.get_sites_within_shapefile('travis_county.shp') # target percentile (this could be made into a input parameter) target_percentile = 0.2 #(i.e. 20%) # start with an empty list list_of_critical_gages = [] # loop through the sites in Travis County for site in travis_sites: print 'processing site: ', site #check if there is discharge data atthis site if '00060' in nwis_uv.sites[site].timeseries: print 'Discharge data available' # calculate the target quantile from historical daily value data threshold = nwis_dv.sites[site].timeseries['00060/DataType=Average'].data.quantile(target_percentile) print 'threshold value: ' , threshold #if the latest downloaded data is below threshold then add to list of critical gages if nwis_uv.sites[site].timeseries['00060'].data[-1] < threshold: print 'adding gage to critical gage list: ', site list_of_critical_gages.append(nwis_uv.sites[site]) #plot discharge at the critical gages plt.figure() for gage in list_of_critical_gages: gage.timeseries['00060'].data.plot(label=gage.name) plt.legend() plt.ylabel = 'Discharge (cfs)' plt.title('Critical Gages in Travis County') plt.savefig('travis_county_critical_gages.png')