Welcome to icy’s Documentation¶
See the GitHub repository and download from PyPI.
API Reference¶
-
icy.
read
(path, cfg={}, raise_on_error=False, silent=False, verbose=False, return_errors=False)¶ Wraps pandas.IO & odo to create a dictionary of pandas.DataFrames from multiple different sources
Parameters: path : str
Location of file, folder or zip-file to be parsed. Can include globbing (e.g. *.csv). Can be remote with URI-notation beginning with e.g. http://, https://, file://, ftp://, s3:// and ssh://. Can be odo-supported database (SQL, MongoDB, Hadoop, Spark) if dependencies are available. Parser will be selected based on file extension.
cfg : dict or str, optional
Dictionary of kwargs to be provided to the pandas parser (http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) or str with path to YAML, that will be parsed.
Special keys:
filters : str or list of strings, optional. For a file to be processed, it must contain one of the Strings (e.g. [‘.csv’, ‘.tsv’])
default : kwargs to be used for every file
custom_date_parser : strptime-format string (https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior), generates a parser that used as the date_parser argument
If filename in keys, use kwargs from that key in addition to or overwriting default kwargs.
silent : boolean, optional
If True, doesn’t print to stdout.
verbose : boolean, optional
If True, prints parsing arguments for each file processed to stdout.
raise_on_error : boolean, optional
Raise exception or only display warning, if a file cannot be parsed successfully.
return_errors : boolean, optional
If True, read() returns (data, errors) tuple instead of only data, with errors as a list of all files that could not be parsed.
Returns: data : dict
Dictionary of parsed pandas.DataFrames, with file names as keys.
Notes
- Start with basic cfg and tune until the desired parsing result is achieved.
- File extensions are critical to determine the parser, make sure they are common.
- Avoid files named ‘default’ or ‘filters’.
- Avoid duplicate file names.
- Subfolders and file names beginning with ‘.’ or ‘_’ are ignored.
- If an https:// URI isn’t correctly processed, try http:// instead.
- To connect to a database or s3-bucket, make sure the required dependencies like sqlalchemy, pymongo, pyspark or boto are available in the active environment.
-
icy.
merge
(data, cfg=None)¶ WORK IN PROGRESS
Concat, merge, join, drop keys in dictionary of pandas.DataFrames into one pandas.DataFrame (data) and a pandas.Series (labels).
Parameters: data : dict of pandas.DataFrames
Result of icy.read()
cfg : dict or str, optional
Dictionary of actions to perform on data or str with path to YAML, that will be parsed.
Returns: data : pandas.DataFrame
The aggregated dataset
labels : pandas.Series
The target variable for analysis of the dataset, can have fewer samples than the aggregated dataset
-
icy.
mem
(data)¶ Total memory used by data
Parameters: data : dict of pandas.DataFrames or pandas.DataFrame
Returns: str : str
Human readable amount of memory used with unit (like KB, MB, GB etc.).
-
icy.
_path_to_objs
(path, include=['*', '.*'], exclude=['.*', '_*'])¶ Turn path with opt. globbing into valid list of files respecting include and exclude patterns.
Parameters: path : str
Path to process. Can be location of a file, folder or glob. Can be in uri-notation, can be relative or absolute or start with ~.
include : list, optional
Globbing patterns to require in result, defaults to [‘*’, ‘.*’].
exclude : list, optional
Globbing patterns to exclude from result, defaults to [‘.*’, ‘_*’].
Returns: objs : list
List of valid files
Notes
- Doesn’t show hidden files starting with ‘.’ by default. To enable hidden files, make sure ‘.*’ is in include and ‘.*’ is not in exclude.
- Doesn’t show files starting with ‘_’ by default. To enable these files, make sure ‘_*’ is not in exclude.
-
icy.
run_examples
(examples)¶ Run read() on a number of examples, supress output, generate summary.
Parameters: examples : list of tuples of three str elements
Tuples contain the path and cfg argument to the read function, as well as the cfg argument to the merge function (TODO) e.g. [(path, read_cfg, merge_cfg), (...)]
Returns: None
Prints all results to stdout.