pgmlearner

This module provides tools to generate Bayesian networks that are “learned” from a data set. The learning process involves finding the Bayesian network that most accurately models data given as input – in other words, finding the Bayesian network that makes the data set most likely. There are two major parts of Bayesian network learning: structure learning and parameter learning. Structure learning means finding the graph that most accurately depicts the dependencies detected in the data. Parameter learning means adjusting the parameters of the CPDs in a graph skeleton to most accurately model the data. This module has tools for both of these tasks.

class libpgm.pgmlearner.PGMLearner[source]

This class is a machine with tools for learning Bayesian networks from data. It contains the discrete_mle_estimateparams, lg_mle_estimateparams, discrete_constraint_estimatestruct, lg_constraint_estimatestruct, discrete_condind, discrete_estimatebn, and lg_estimatebn methods.

discrete_mle_estimateparams(graphskeleton, data)[source]

Estimate parameters for a discrete Bayesian network with a structure given by graphskeleton in order to maximize the probability of data given by data. This function takes the following arguments:

  1. graphskeleton – An instance of the GraphSkeleton class containing vertex and edge data.

  2. data – A list of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 'B',
            'SAT': 'lowscore',
            ...
        },
        ...
    ]

This function normalizes the distribution of a node’s outcomes for each combination of its parents’ outcomes. In doing so it creates an estimated tabular conditional probability distribution for each node. It then instantiates a DiscreteBayesianNetwork instance based on the graphskeleton, and modifies that instance’s Vdata attribute to reflect the estimated CPDs. It then returns the instance.

The Vdata attribute instantiated is in the format seen in discrete bayesian network, as described in discretebayesiannetwork.

Usage example: this would learn parameters from a set of 200 discrete samples:

import json

from libpgm.nodedata import NodeData
from libpgm.graphskeleton import GraphSkeleton
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.pgmlearner import PGMLearner

# generate some data to use
nd = NodeData()
nd.load("../tests/unittestdict.txt")    # an input file
skel = GraphSkeleton()
skel.load("../tests/unittestdict.txt")
skel.toporder()
bn = DiscreteBayesianNetwork(skel, nd)
data = bn.randomsample(200)

# instantiate my learner 
learner = PGMLearner()

# estimate parameters from data and skeleton
result = learner.discrete_mle_estimateparams(skel, data)

# output
print json.dumps(result.Vdata, indent=2)
lg_mle_estimateparams(graphskeleton, data)[source]

Estimate parameters for a linear Gaussian Bayesian network with a structure given by graphskeleton in order to maximize the probability of data given by data. This function takes the following arguments:

  1. graphskeleton – An instance of the GraphSkeleton class containing vertex and edge data.

  2. data – A list of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 74.343,
            'Intelligence': 29.545,
            ...
        },
        ...
    ]

The algorithm used to calculate the linear Gaussian parameters is beyond the scope of this documentation – for a full explanation, cf. Koller et al. 729. After the parameters are calculated, the program instantiates a DiscreteBayesianNetwork instance based on the graphskeleton, and modifies that instance’s Vdata attribute to reflect the estimated CPDs. It then returns the instance.

The Vdata attribute instantiated is in the format seen in the input file example discrete bayesian network, as described in discretebayesiannetwork.

Usage example: this would learn parameters from a set of 200 linear Gaussian samples:

import json

from libpgm.nodedata import NodeData
from libpgm.graphskeleton import GraphSkeleton
from libpgm.lgbayesiannetwork import LGBayesianNetwork
from libpgm.pgmlearner import PGMLearner

# generate some data to use
nd = NodeData()
nd.load("../tests/unittestlgdict.txt")    # an input file
skel = GraphSkeleton()
skel.load("../tests/unittestdict.txt")
skel.toporder()
lgbn = LGBayesianNetwork(skel, nd)
data = lgbn.randomsample(200)

# instantiate my learner 
learner = PGMLearner()

# estimate parameters
result = learner.lg_mle_estimateparams(skel, data)

# output
print json.dumps(result.Vdata, indent=2)
discrete_constraint_estimatestruct(data, pvalparam=0.050000000000000003, indegree=1)[source]

Learn a Bayesian network structure from discrete data given by data, using constraint-based approaches. This function first calculates all the independencies and conditional independencies present between variables in the data. To calculate dependencies, it uses the discrete_condind method on each pair of variables, conditioned on other sets of variables of size indegree or smaller, to generate a chi-squared result and a p-value. If this p-value is less than pvalparam, the pair of variables are considered dependent conditioned on the variable set. Once all true dependencies – pairs of variables that are dependent no matter what they are conditioned by – are found, the algorithm uses these dependencies to construct a directed acyclic graph. It returns this DAG in the form of a GraphSkeleton class.

Arguments:
  1. data – An array of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 'B',
            'SAT': 'lowscore',
            ...
        },
        ...
    ]
  2. pvalparam – (Optional, default is 0.05) The p-value below which to consider something significantly unlikely. A common number used is 0.05. This is passed to discrete_condind when it is called.

  3. indegree – (Optional, default is 1) The upper bound on the size of a witness set (see Koller et al. 85). If this is larger than 1, a huge amount of samples in data are required to avoid a divide-by-zero error.

Usage example: this would learn structure from a set of 8000 discrete samples:

import json

from libpgm.nodedata import NodeData
from libpgm.graphskeleton import GraphSkeleton
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.pgmlearner import PGMLearner

# generate some data to use
nd = NodeData()
nd.load("../tests/unittestdict.txt")    # an input file
skel = GraphSkeleton()
skel.load("../tests/unittestdict.txt")
skel.toporder()
bn = DiscreteBayesianNetwork(skel, nd)
data = bn.randomsample(8000)

# instantiate my learner 
learner = PGMLearner()

# estimate structure
result = learner.discrete_constraint_estimatestruct(data)

# output
print json.dumps(result.E, indent=2)
lg_constraint_estimatestruct(data, pvalparam=0.050000000000000003, bins=10, indegree=1)[source]

Learn a Bayesian network structure from linear Gaussian data given by data using constraint-based approaches. This function works by discretizing the linear Gaussian data into bins number of bins, and running the discrete_constraint_estimatestruct method on that discrete data with pvalparam and indegree as arguments. It returns the GraphSkeleton instance returned by this function.

Arguments:
  1. data – An array of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 78.3223,
            'SAT': 56.33,
            ...
        },
        ...
    ]
  2. pvalparam – (Optional, default is 0.05) The p-value below which to consider something significantly unlikely. A common number used is 0.05

  3. bins – (Optional, default is 10) The number of bins to discretize the data into. The method is to find the highest and lowest value, divide that interval uniformly into a certain number of bins, and place the data inside. This number must be chosen carefully in light of the number of trials. There must be at least 5 trials in every bin, with more if the indegree is increased.

  4. indegree – (Optional, default is 1) The upper bound on the size of a witness set (see Koller et al. 85). If this is larger than 1, a huge amount of trials are required to avoid a divide-by-zero error.

The number of bins and indegree must be chosen carefully based on the size and nature of the data set. Too many bins will lead to not enough data per bin, while too few bins will lead to dependencies not getting noticed.

Usage example: this would learn structure from a set of 8000 linear Gaussian samples:

import json

from libpgm.nodedata import NodeData
from libpgm.graphskeleton import GraphSkeleton
from libpgm.lgbayesiannetwork import LGBayesianNetwork
from libpgm.pgmlearner import PGMLearner

# generate some data to use
nd = NodeData()
nd.load("../tests/unittestdict.txt")    # an input file
skel = GraphSkeleton()
skel.load("../tests/unittestdict.txt")
skel.toporder()
lgbn = LGBayesianNetwork(skel, nd)
data = lgbn.randomsample(8000)

# instantiate my learner 
learner = PGMLearner()

# estimate structure
result = learner.lg_constraint_estimatestruct(data)

# output
print json.dumps(result.E, indent=2)
discrete_condind(data, X, Y, U)[source]

Test how independent a variable X and a variable Y are in a discrete data set given by data, where the independence is conditioned on a set of variables given by U. This method works by assuming as a null hypothesis that the variables are conditionally independent on U, and thus that:

\[P(X, Y, U) = P(U) \cdot P(X|U) \cdot P(Y|U) \]

It tests the deviance of the data from this null hypothesis, returning the result of a chi-square test and a p-value.

Arguments:
  1. data – An array of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 'B',
            'SAT': 'lowscore',
            ...
        },
        ...
    ]
  2. X – A variable whose dependence on Y we are testing given U.

  3. Y – A variable whose dependence on X we are testing given U.

  4. U – A list of variables that are given.

Returns:
  1. chi – The result of the chi-squared test on the data. This is a

    measure of the deviance of the actual distribution of X and Y given U from the expected distribution of X and Y given U. Since the null hypothesis is that X and Y are independent given U, the expected distribution is that \(P(X, Y, U) = P(U) P(X | U) P (Y | U)\).

  2. pval – The p-value of the test, meaning the probability of

    attaining a chi-square result as extreme as or more extreme than the one found, assuming that the null hypothesis is true. (e.g., a p-value of .05 means that if X and Y were independent given U, the chance of getting a chi-squared result this high or higher are .05)

  3. U – The ‘witness’ of X and Y’s independence. This is the variable

    that, when it is known, leaves X and Y independent.

For more information see Koller et al. 790.

discrete_estimatebn(data, pvalparam=0.050000000000000003, indegree=1)[source]

Fully learn a Bayesian network from discrete data given by data. This function combines the discrete_constraint_estimatestruct method (where it passes in the pvalparam and indegree arguments) with the discrete_mle_estimateparams method. It returns a complete DiscreteBayesianNetwork class instance learned from the data.

Arguments:
  1. data – An array of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 'B',
            'SAT': 'lowscore',
            ...
        },
        ...
    ]
  2. pvalparam – The p-value below which to consider something significantly unlikely. A common number used is 0.05

  3. indegree – The upper bound on the size of a witness set (see Koller et al. 85). If this is larger than 1, a huge amount of trials are required to avoid a divide-by- zero error.

lg_estimatebn(data, pvalparam=0.050000000000000003, bins=10, indegree=1)[source]

Fully learn a Bayesian network from linear Gaussian data given by data. This function combines the lg_constraint_estimatestruct method (where it passes in the pvalparam, bins, and indegree arguments) with the lg_mle_estimateparams method. It returns a complete LGBayesianNetwork class instance learned from the data.

Arguments:
  1. data – An array of dicts containing samples from the network in {vertex: value} format. Example:

    [
        {
            'Grade': 75.23423,
            'SAT': 873.42342,
            ...
        },
        ...
    ]
  2. pvalparam – The p-value below which to consider something significantly unlikely. A common number used is 0.05

  3. indegree – The upper bound on the size of a witness set (see Koller et al. 85). If this is larger than 1, a huge amount of trials are required to avoid a divide-by- zero error.

Usage example: this would learn entire Bayesian networks from sets of 8000 data points:

import json

from libpgm.nodedata import NodeData
from libpgm.graphskeleton import GraphSkeleton
from libpgm.lgbayesiannetwork import LGBayesianNetwork
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.pgmlearner import PGMLearner

# LINEAR GAUSSIAN

# generate some data to use
nd = NodeData()
nd.load("../tests/unittestlgdict.txt")    # an input file
skel = GraphSkeleton()
skel.load("../tests/unittestdict.txt")
skel.toporder()
lgbn = LGBayesianNetwork(skel, nd)
data = lgbn.randomsample(8000)

# instantiate my learner 
learner = PGMLearner()

# learn bayesian network
result = learner.lg_estimatebn(data)

# output
print json.dumps(result.E, indent=2)
print json.dumps(result.Vdata, indent=2)

# DISCRETE

# generate some data to use
nd = NodeData()
nd.load("../tests/unittestdict.txt")    # an input file
skel = GraphSkeleton()
skel.load("../tests/unittestdict.txt")
skel.toporder()
bn = DiscreteBayesianNetwork(skel, nd)
data = bn.randomsample(8000)

# instantiate my learner 
learner = PGMLearner()

# learn bayesian network
result = learner.discrete_estimatebn(data)

# output
print json.dumps(result.E, indent=2)
print json.dumps(result.Vdata, indent=2)

Previous topic

sampleaggregator

Next topic

CPDtypes

This Page