Welcome to icy’s Documentation¶

See the GitHub repository and download from PyPI.

Examples
- Parsing many HTML-Files
- Parsing many compressed CSV-Files

API Reference¶

icy.read(path, cfg={}, raise_on_error=False, silent=False, verbose=False, return_errors=False)¶

Wraps pandas.IO & odo to create a dictionary of pandas.DataFrames from multiple different sources

Parameters:

path : str

Location of file, folder or zip-file to be parsed. Can include globbing (e.g. *.csv). Can be remote with URI-notation beginning with e.g. http://, https://, file://, ftp://, s3:// and ssh://. Can be odo-supported database (SQL, MongoDB, Hadoop, Spark) if dependencies are available. Parser will be selected based on file extension.

cfg : dict or str, optional

Dictionary of kwargs to be provided to the pandas parser (http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) or str with path to YAML, that will be parsed.

Special keys:

filters : str or list of strings, optional. For a file to be processed, it must contain one of the Strings (e.g. [‘.csv’, ‘.tsv’])

default : kwargs to be used for every file

custom_date_parser : strptime-format string (https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior), generates a parser that used as the date_parser argument

If filename in keys, use kwargs from that key in addition to or overwriting default kwargs.

silent : boolean, optional

If True, doesn’t print to stdout.

verbose : boolean, optional

If True, prints parsing arguments for each file processed to stdout.

raise_on_error : boolean, optional

Raise exception or only display warning, if a file cannot be parsed successfully.

return_errors : boolean, optional

If True, read() returns (data, errors) tuple instead of only data, with errors as a list of all files that could not be parsed.

Returns:

data : dict

Dictionary of parsed pandas.DataFrames, with file names as keys.

Notes

Start with basic cfg and tune until the desired parsing result is achieved.
File extensions are critical to determine the parser, make sure they are common.
Avoid files named ‘default’ or ‘filters’.
Avoid duplicate file names.
Subfolders and file names beginning with ‘.’ or ‘_’ are ignored.
If an https:// URI isn’t correctly processed, try http:// instead.
To connect to a database or s3-bucket, make sure the required dependencies like sqlalchemy, pymongo, pyspark or boto are available in the active environment.

icy.merge(data, cfg=None)¶

WORK IN PROGRESS

Concat, merge, join, drop keys in dictionary of pandas.DataFrames into one pandas.DataFrame (data) and a pandas.Series (labels).

Parameters:

data : dict of pandas.DataFrames

Result of icy.read()

cfg : dict or str, optional

Dictionary of actions to perform on data or str with path to YAML, that will be parsed.

Returns:

data : pandas.DataFrame

The aggregated dataset

labels : pandas.Series

The target variable for analysis of the dataset, can have fewer samples than the aggregated dataset

icy.mem(data)¶

Total memory used by data

Parameters:

data : dict of pandas.DataFrames or pandas.DataFrame

Returns:

str : str

Human readable amount of memory used with unit (like KB, MB, GB etc.).

icy._path_to_objs(path, include=['*', '.*'], exclude=['.*', '_*'])¶

Turn path with opt. globbing into valid list of files respecting include and exclude patterns.

Parameters:

path : str

Path to process. Can be location of a file, folder or glob. Can be in uri-notation, can be relative or absolute or start with ~.

include : list, optional

Globbing patterns to require in result, defaults to [‘*’, ‘.*’].

exclude : list, optional

Globbing patterns to exclude from result, defaults to [‘.*’, ‘_*’].

Returns:

objs : list

List of valid files

Notes

Doesn’t show hidden files starting with ‘.’ by default. To enable hidden files, make sure ‘.*’ is in include and ‘.*’ is not in exclude.
Doesn’t show files starting with ‘_’ by default. To enable these files, make sure ‘_*’ is not in exclude.

icy.run_examples(examples)¶

Run read() on a number of examples, supress output, generate summary.

Parameters:

examples : list of tuples of three str elements

Tuples contain the path and cfg argument to the read function, as well as the cfg argument to the merge function (TODO) e.g. [(path, read_cfg, merge_cfg), (...)]

Returns:

None

Prints all results to stdout.

Welcome to icy’s Documentation¶

API Reference¶

Indices and tables¶

Table Of Contents

Related Topics

This Page