data – Pebl Dataset

This module provides classes and functions for working with datasets.

Configuration Parameters

data.discretize

Number of bins used to discretize data. Specify 0 to indicate that data should not be discretized. default=0

data.filename

File to read data from. default=None

data.text

The text of a dataset included in config file. default=’‘

Dataset class

class pebl.data.Dataset(observations, missing=None, interventions=None, variables=None, samples=None, skip_stats=False)

Create a pebl Dataset instance.

A Dataset consists of the following data structures which are all numpy.ndarray instances:

  • observations: a 2D matrix of observed values.
    • dimension 1 is over samples, dimension 2 is over variables.
    • observations[i,j] is the observed value for jth variable in the ith sample.
  • missing: a 2D binary mask for missing values
    • missing[i,j] = 1 IFF observation[i,j] is missing
  • interventions: a 2D binary mask for interventions
    • interventions[i,j] = 1 IFF the jth variable was intervened upon in the ith sample.
  • variables,samples: 1D array of variable or sample annotations

This class provides a few public methods to manipulate datasets; one can also use numpy functions/methods directly.

Required/Default values:

  • The only required argument is observations (a 2D numpy array).
  • If missing or interventions are not specified, they are assumed to be all zeros (no missing values and no interventions).
  • If variables or samples are not specified, appropriate Variable or Sample annotations are created with only the name attribute.
Note:
If you alter Dataset.interventions or Dataset.missing, you must call Dataset._calc_stats(). This is a terrible hack but it speeds up pebl when used with datasets without interventions or missing values (a common case).
check_arities()

Checks whether the specified airty >= number of unique observations.

The check is only performed for discrete variables.

If this check fails, the CPT and other data structures would fail. So, we should raise error while loading the data. Fail Early and Explicitly!

discretize(includevars=None, excludevars=[], numbins=3)

Discretize (or bin) the data in-place.

This method is just an alias for pebl.discretizer.maximum_entropy_discretizer() See the module documentation for pebl.discretizer for more information.

has_interventions

Whether the dataset has any interventions.

has_missing

Whether the dataset has any missing values.

shape

The shape of the dataset as (number of samples, number of variables).

subset(variables=None, samples=None)

Returns a subset of the dataset (and metadata).

Specify the variables and samples for creating a subset of the data. variables and samples should be a list of ids. If not specified, it is assumed to be all variables or samples.

Some examples:

  • d.subset([3], [4])
  • d.subset([3,1,2])
  • d.subset(samples=[5,2,7,1])

Note: order matters! d.subset([3,1,2]) != d.subset([1,2,3])

subset_byname(variables=None, samples=None)

Returns a subset of the dataset (and metadata).

Same as Dataset.subset() except that variables and samples can be specified by their names.

Some examples:

  • d.subset(variables=[‘shh’, ‘genex’])
  • s.subset(samples=[“control%d” % i for i in xrange(10)])
tofile(filename, *args, **kwargs)

Write the data and metadata to file in a tab-delimited format.

tostring(linesep='n', variable_header=True, sample_header=True)

Return the data and metadata as a string in a tab-delimited format.

If variable_header is True, include variable names and type. If sample_header is True, include sample names. Both are True by default.

Functions

pebl.data.fromstring(stringrep, fieldsep='\t')

Parse the string representation of a dataset and return a Dataset instance.

See the documentation for fromfile() for information about file format.

pebl.data.fromfile(filename)

Parse file and return a Dataset instance.

The data file is expected to conform to the following format

  • comment lines begin with ‘#’ and are ignored.
  • The first non-comment line must specify variable annotations separated by tab characters.
  • data lines specify the data values separated by tab characters.
  • data lines can include sample names

A data value specifies the observed numeric value, whether it’s missing and whether it represents an intervention:

  • An ‘x’ or ‘X’ indicate that the value is missing
  • A ‘!’ before or after the numeric value indicates an intervention

Variable annotations specify the name of the variable and, optionally, the data type.

Examples include:

  • Foo : just variable name

  • Foo,continuous : Foo is a continuous variable

  • Foo,discrete(3) : Foo is a discrete variable with arity of 3

  • Foo,class(normal,cancer): Foo is a class variable with arity of 2 and

    values of either normal or cancer.

pebl.data.fromconfig()

Create a Dataset from the configuration information.

Loads data and discretizes (if requested) based on configuration parameters.

pebl.data.merge(datasets, axis=None)

Merges multiple datasets.

datasets should be a list of Dataset objects. axis should be either ‘variables’ or ‘samples’ and determines how the datasets are merged.

Variable and sample annotations

class pebl.data.Variable(name, *args)

Additional information about a variable.

class pebl.data.DiscreteVariable(name, param)

A variable from a discrete domain.

class pebl.data.ContinuousVariable(name, param)

A variable from a continuous domain.

class pebl.data.ClassVariable(name, param)

A labeled, discrete variable.

class pebl.data.Sample(name, *args)

Additional information about a sample.

Exceptions

exception pebl.data.ParsingError

Error encountered while parsing an ill-formed datafile.

exception pebl.data.IncorrectArityError(errors)

Error encountered when the datafile speifies an incorrect variable arity.

If variable arity is specified, it should be greater than the number of unique observation values for the variable.

exception pebl.data.ClassVariableError

Error with a class variable.