Welcome to icy’s Documentation

See the GitHub repository and download from PyPI.

API Reference

icy.read(path, cfg={}, raise_on_error=False, silent=False, verbose=False, return_errors=False)

Wraps pandas.IO & odo to create a dictionary of pandas.DataFrames from multiple different sources

Parameters:

path : str

Location of file, folder or zip-file to be parsed. Can include globbing (e.g. *.csv). Can be remote with URI-notation beginning with e.g. http://, https://, file://, ftp://, s3:// and ssh://. Can be odo-supported database (SQL, MongoDB, Hadoop, Spark) if dependencies are available. Parser will be selected based on file extension.

cfg : dict or str, optional

Dictionary of kwargs to be provided to the pandas parser (http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) or str with path to YAML, that will be parsed.

Special keys:

filters : str or list of strings, optional. For a file to be processed, it must contain one of the Strings (e.g. [‘.csv’, ‘.tsv’])

default : kwargs to be used for every file

custom_date_parser : strptime-format string (https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior), generates a parser that used as the date_parser argument

If filename in keys, use kwargs from that key in addition to or overwriting default kwargs.

silent : boolean, optional

If True, doesn’t print to stdout.

verbose : boolean, optional

If True, prints parsing arguments for each file processed to stdout.

raise_on_error : boolean, optional

Raise exception or only display warning, if a file cannot be parsed successfully.

return_errors : boolean, optional

If True, read() returns (data, errors) tuple instead of only data, with errors as a list of all files that could not be parsed.

Returns:

data : dict

Dictionary of parsed pandas.DataFrames, with file names as keys.

Notes

  • Start with basic cfg and tune until the desired parsing result is achieved.
  • File extensions are critical to determine the parser, make sure they are common.
  • Avoid files named ‘default’ or ‘filters’.
  • Avoid duplicate file names.
  • Subfolders and file names beginning with ‘.’ or ‘_’ are ignored.
  • If an https:// URI isn’t correctly processed, try http:// instead.
  • To connect to a database or s3-bucket, make sure the required dependencies like sqlalchemy, pymongo, pyspark or boto are available in the active environment.
icy.merge(data, cfg=None)

WORK IN PROGRESS

Concat, merge, join, drop keys in dictionary of pandas.DataFrames into one pandas.DataFrame (data) and a pandas.Series (labels).

Parameters:

data : dict of pandas.DataFrames

Result of icy.read()

cfg : dict or str, optional

Dictionary of actions to perform on data or str with path to YAML, that will be parsed.

Returns:

data : pandas.DataFrame

The aggregated dataset

labels : pandas.Series

The target variable for analysis of the dataset, can have fewer samples than the aggregated dataset

icy.mem(data)

Total memory used by data

Parameters:

data : dict of pandas.DataFrames or pandas.DataFrame

Returns:

str : str

Human readable amount of memory used with unit (like KB, MB, GB etc.).

icy._path_to_objs(path, include=['*', '.*'], exclude=['.*', '_*'])

Turn path with opt. globbing into valid list of files respecting include and exclude patterns.

Parameters:

path : str

Path to process. Can be location of a file, folder or glob. Can be in uri-notation, can be relative or absolute or start with ~.

include : list, optional

Globbing patterns to require in result, defaults to [‘*’, ‘.*’].

exclude : list, optional

Globbing patterns to exclude from result, defaults to [‘.*’, ‘_*’].

Returns:

objs : list

List of valid files

Notes

  • Doesn’t show hidden files starting with ‘.’ by default. To enable hidden files, make sure ‘.*’ is in include and ‘.*’ is not in exclude.
  • Doesn’t show files starting with ‘_’ by default. To enable these files, make sure ‘_*’ is not in exclude.
icy.run_examples(examples)

Run read() on a number of examples, supress output, generate summary.

Parameters:

examples : list of tuples of three str elements

Tuples contain the path and cfg argument to the read function, as well as the cfg argument to the merge function (TODO) e.g. [(path, read_cfg, merge_cfg), (...)]

Returns:

None

Prints all results to stdout.

Indices and tables