This module provides classes and functions for working with datasets.
Number of bins used to discretize data. Specify 0 to indicate that data should not be discretized. default=0
File to read data from. default=None
The text of a dataset included in config file. default=’‘
Create a pebl Dataset instance.
A Dataset consists of the following data structures which are all numpy.ndarray instances:
variables,samples: 1D array of variable or sample annotations
This class provides a few public methods to manipulate datasets; one can also use numpy functions/methods directly.
Required/Default values:
- The only required argument is observations (a 2D numpy array).
- If missing or interventions are not specified, they are assumed to be all zeros (no missing values and no interventions).
- If variables or samples are not specified, appropriate Variable or Sample annotations are created with only the name attribute.
Checks whether the specified airty >= number of unique observations.
The check is only performed for discrete variables.
If this check fails, the CPT and other data structures would fail. So, we should raise error while loading the data. Fail Early and Explicitly!
Discretize (or bin) the data in-place.
This method is just an alias for pebl.discretizer.maximum_entropy_discretizer() See the module documentation for pebl.discretizer for more information.
Whether the dataset has any interventions.
Whether the dataset has any missing values.
The shape of the dataset as (number of samples, number of variables).
Returns a subset of the dataset (and metadata).
Specify the variables and samples for creating a subset of the data. variables and samples should be a list of ids. If not specified, it is assumed to be all variables or samples.
Some examples:
- d.subset([3], [4])
- d.subset([3,1,2])
- d.subset(samples=[5,2,7,1])
Note: order matters! d.subset([3,1,2]) != d.subset([1,2,3])
Returns a subset of the dataset (and metadata).
Same as Dataset.subset() except that variables and samples can be specified by their names.
Some examples:
- d.subset(variables=[‘shh’, ‘genex’])
- s.subset(samples=[“control%d” % i for i in xrange(10)])
Write the data and metadata to file in a tab-delimited format.
Return the data and metadata as a string in a tab-delimited format.
If variable_header is True, include variable names and type. If sample_header is True, include sample names. Both are True by default.
Parse the string representation of a dataset and return a Dataset instance.
See the documentation for fromfile() for information about file format.
Parse file and return a Dataset instance.
The data file is expected to conform to the following format
- comment lines begin with ‘#’ and are ignored.
- The first non-comment line must specify variable annotations separated by tab characters.
- data lines specify the data values separated by tab characters.
- data lines can include sample names
A data value specifies the observed numeric value, whether it’s missing and whether it represents an intervention:
- An ‘x’ or ‘X’ indicate that the value is missing
- A ‘!’ before or after the numeric value indicates an intervention
Variable annotations specify the name of the variable and, optionally, the data type.
Examples include:
Foo : just variable name
Foo,continuous : Foo is a continuous variable
Foo,discrete(3) : Foo is a discrete variable with arity of 3
- Foo,class(normal,cancer): Foo is a class variable with arity of 2 and
values of either normal or cancer.
Create a Dataset from the configuration information.
Loads data and discretizes (if requested) based on configuration parameters.
Merges multiple datasets.
datasets should be a list of Dataset objects. axis should be either ‘variables’ or ‘samples’ and determines how the datasets are merged.
Additional information about a variable.
A variable from a discrete domain.
A variable from a continuous domain.
A labeled, discrete variable.
Additional information about a sample.
Error encountered while parsing an ill-formed datafile.
Error encountered when the datafile speifies an incorrect variable arity.
If variable arity is specified, it should be greater than the number of unique observation values for the variable.
Error with a class variable.