Examples ======== Parsing many HTML-Files ----------------------- Download and unzip the `Baby Names dataset `_ (part of the great `Google Python Course `_). From the archive we'll just use the `/babynames`-folder (you can leave the .py-files and the subfolder in place). The dataset consists of 10 regular html-files taken from http://www.ssa.gov/. If you preview the files, you'll find tables embedded in the html with three columns `Rank`, `Male name`, `Female name` and 1000 rows per year. :: >>> import icy >>> icy.preview('~/Downloads/google-python-exercises/babynames/') This results in a flood of data on your screen. Obviously the parser detects more tables than just the names. If you scroll through the preview result you'll find results like this: :: File: baby2008.html_baby2008.html_2 Int64Index: 1002 entries, 0 to 1001 Data columns (total 3 columns): 0 1002 non-null object 1 1001 non-null object 2 1001 non-null object dtypes: object(3) memory usage: 31.3+ KB COLUMN | first 5 VALUES ---------------------------------------- 0 | ['Rank', '1', '2', '3', '4'] 1 | ['Male name', 'Jacob', 'Michael', 'Ethan', 'Joshua'] 2 | ['Female name', 'Emma', 'Isabella', 'Emily', 'Madison'] This looks roughly like what we are aiming for. By passing the parsing arguments **header = 0** and **index_col = 0** takes care of identifying the correct column names and using the `Rank` column as the index. Since the html-files contain more tags, we have a number of results we would want to ignore. Luckily you can identify the interesting tables by filtering for tables featuring `summary="Popularity for top 1000"`. There keyword to accomplish this is **attrs = {'summary': 'Popularity for top 1000'}**. To get rid of the error, that the file `babynames.py` could not be parsed we can pass **filters = '.html'**. :: >>> import icy >>> src = '~/Downloads/google-python-exercises/babynames/' >>> icy.preview(src, cfg = { 'filters': '.html', 'default': { 'header': 0, 'index_col': 0, 'attrs': {'summary': 'Popularity for top 1000'} }}) This results in a preview of the first rows of all 10 tables. To simplify generating and testing the required parsing arguments, you can provide the location of a YAML-file to the `cfg`-keyword. A `babynames_read.yml` for this example would be: :: filters: - '.html' default: attrs: {summary: Popularity for top 1000} header: 0 index_col: 0 Now run the whole thing: :: >>> import icy >>> src = '~/Downloads/google-python-exercises/babynames/' >>> cfg = '~/Downloads/google-python-exercises/babynames_read.yml' >>> data = icy.read(src, cfg) **data** is a dict of pandas.DataFrames with 10 keys and a total memory usage of 234.6 KB. **Note:** We used the 'default'-key to apply the parsing arguments to every file. For more heterogenous data, you can specify different parsing arguments by using the filename (e.g. 'baby1990.html') as a key. If you specify 'default' and specifig arguments, the 'default' is still applied to all files but the specific arguments override the 'default'. Parsing many compressed CSV-Files --------------------------------- Download the `Lahman Baseball dataset `_ (from Sean Lahman's `extensive Baseball resources `_) and the `Catapillar Tube Pricing dataset `_ (from `Kaggles Catapillar competition `_). The Lahman dataset consists of 24 csv-files and a `readme.txt` while the Caterpillar dataset features 21 csv-files. :: >>> import icy >>> icy.preview('~/Downloads/lahman-csv_2015-01-24.zip') # output for lahman-csv_2015-01-24.zip >>> icy.preview('~/Downloads/data.zip') # output for data.zip Again a lot of data appears on your screen for each of the two datasets. Most of the results seem quite sensible but we can still do a little better with this `lahman_read.yml` (ignore the readme.txt and parse date-columns): :: filters: - '.csv' Master.csv: parse_dates: ['debut', 'finalGame'] and this `cat_read.yml` (parse custom nan- and boolean-values and date-columns): :: default: na_values: ['NA', 'NONE'] true_values: ['Yes', 'Y'] false_values: ['No', 'N'] test_set.csv: parse_dates: ['quote_date'] train_set.csv: parse_dates: ['quote_date'] Now run the whole thing: :: >>> import icy >>> src = '~/Downloads/lahman-csv_2015-01-24.zip' >>> cfg = '~/Downloads/lahman_read.yml' >>> data = icy.read(src, cfg) # data for lahman-csv_2015-01-24.zip >>> src = '~/Downloads/data.zip' >>> cfg = '~/Downloads/cat_read.yml' >>> data = icy.read(src, cfg) # data for data.zip **data** is a dict of pandas.DataFrames with 24 keys and a total memory usage of 82.3 MB or 21 keys and a total memory usage of 11.0 MB respectively.