Examples
========
Parsing many HTML-Files
-----------------------
Download and unzip the `Baby Names dataset
`_ (part of the great
`Google Python Course
`_). From the archive we'll just use the `/babynames`-folder (you can leave the .py-files and the subfolder in place).
The dataset consists of 10 regular html-files taken from http://www.ssa.gov/. If you preview the files, you'll find tables embedded in the html with three columns `Rank`, `Male name`, `Female name` and 1000 rows per year.
::
>>> import icy
>>> icy.preview('~/Downloads/google-python-exercises/babynames/')
This results in a flood of data on your screen. Obviously the parser detects more tables than just the names. If you scroll through the preview result you'll find results like this:
::
File: baby2008.html_baby2008.html_2
Int64Index: 1002 entries, 0 to 1001
Data columns (total 3 columns):
0 1002 non-null object
1 1001 non-null object
2 1001 non-null object
dtypes: object(3)
memory usage: 31.3+ KB
COLUMN | first 5 VALUES
----------------------------------------
0 | ['Rank', '1', '2', '3', '4']
1 | ['Male name', 'Jacob', 'Michael', 'Ethan', 'Joshua']
2 | ['Female name', 'Emma', 'Isabella', 'Emily', 'Madison']
This looks roughly like what we are aiming for. By passing the parsing arguments **header = 0** and **index_col = 0** takes care of identifying the correct column names and using the `Rank` column as the index.
Since the html-files contain more tags, we have a number of results we would want to ignore. Luckily you can identify the interesting tables by filtering for tables featuring `summary="Popularity for top 1000"`. There keyword to accomplish this is **attrs = {'summary': 'Popularity for top 1000'}**.
To get rid of the error, that the file `babynames.py` could not be parsed we can pass **filters = '.html'**.
::
>>> import icy
>>> src = '~/Downloads/google-python-exercises/babynames/'
>>> icy.preview(src,
cfg = {
'filters': '.html',
'default': {
'header': 0,
'index_col': 0,
'attrs': {'summary': 'Popularity for top 1000'}
}})
This results in a preview of the first rows of all 10 tables. To simplify generating and testing the required parsing arguments, you can provide the location of a YAML-file to the `cfg`-keyword. A `babynames_read.yml` for this example would be:
::
filters:
- '.html'
default:
attrs: {summary: Popularity for top 1000}
header: 0
index_col: 0
Now run the whole thing:
::
>>> import icy
>>> src = '~/Downloads/google-python-exercises/babynames/'
>>> cfg = '~/Downloads/google-python-exercises/babynames_read.yml'
>>> data = icy.read(src, cfg)
**data** is a dict of pandas.DataFrames with 10 keys and a total memory usage of 234.6 KB.
**Note:** We used the 'default'-key to apply the parsing arguments to every file. For more heterogenous data, you can specify different parsing arguments by using the filename (e.g. 'baby1990.html') as a key. If you specify 'default' and specifig arguments, the 'default' is still applied to all files but the specific arguments override the 'default'.
Parsing many compressed CSV-Files
---------------------------------
Download the `Lahman Baseball dataset
`_ (from Sean Lahman's `extensive Baseball resources
`_) and the `Catapillar Tube Pricing dataset
`_ (from `Kaggles Catapillar competition
`_). The Lahman dataset consists of 24 csv-files and a `readme.txt` while the Caterpillar dataset features 21 csv-files.
::
>>> import icy
>>> icy.preview('~/Downloads/lahman-csv_2015-01-24.zip')
# output for lahman-csv_2015-01-24.zip
>>> icy.preview('~/Downloads/data.zip')
# output for data.zip
Again a lot of data appears on your screen for each of the two datasets. Most of the results seem quite sensible but we can still do a little better with this `lahman_read.yml` (ignore the readme.txt and parse date-columns):
::
filters:
- '.csv'
Master.csv:
parse_dates: ['debut', 'finalGame']
and this `cat_read.yml` (parse custom nan- and boolean-values and date-columns):
::
default:
na_values: ['NA', 'NONE']
true_values: ['Yes', 'Y']
false_values: ['No', 'N']
test_set.csv:
parse_dates: ['quote_date']
train_set.csv:
parse_dates: ['quote_date']
Now run the whole thing:
::
>>> import icy
>>> src = '~/Downloads/lahman-csv_2015-01-24.zip'
>>> cfg = '~/Downloads/lahman_read.yml'
>>> data = icy.read(src, cfg)
# data for lahman-csv_2015-01-24.zip
>>> src = '~/Downloads/data.zip'
>>> cfg = '~/Downloads/cat_read.yml'
>>> data = icy.read(src, cfg)
# data for data.zip
**data** is a dict of pandas.DataFrames with 24 keys and a total memory usage of 82.3 MB or 21 keys and a total memory usage of 11.0 MB respectively.