How to define TabularCPD and LinearGaussianCPD

One can implement TabularCPD for discrete variables inside DiscreteBayesianNetwork. and LinearGaussianCPD for continuous variables inside LinearGaussianBayesianNetwork.

In this tutorial, we will demonstrate how to define each CPD.

TabularCPD for discrete variables

In tabular CPD, the probability for discrete variable is given as a table. Let us start with examples for independent variables.

[1]:
from pgmpy.factors.discrete import TabularCPD

cpd_coin_fair = TabularCPD(variable="coin", variable_card=2, values=[[0.5], [0.5]], state_names={'Coin': ['Head', 'Tail']})
cpd_coin_biased = TabularCPD(variable="coin", variable_card=2, values=[[0.9], [0.1]], state_names={'Coin': ['Head', 'Tail']})

For a coin flip, we have discrete number (2) possible outcomes, therefore variable_card=2 is passed. values pass probabilities for each outcome, which sum up to 1. state_names is optional, it gives names for each outcome.

[2]:
cpd_smoke = TabularCPD(variable="Smoker", variable_card=2, values=[[0.3], [0.7]], state_names={'Smoker': ['Non-smoker', 'Smoker']})
cpd_pollution = TabularCPD(variable="Pollution", variable_card=3, values=[[0.7], [0.29], [0.01]], state_names={'Pollution': ['Clean', 'Bad', 'Fatal']})

The Pollution and Smoker variables do not depend on other variables. They take categorical values. variable_card denotes how many categories the variable can take. For example Smoker can take a binary value since variable_card=2. values is an array of probability values for each category. Note that the probability values sum up to 1.

Tabular CPD with multiple evidence.

We next consider another discrete variable Cancer, with a model assumption that it depends on Pollution and Smoker.

[3]:
cpd_cancer = TabularCPD(
    variable="Cancer",
    variable_card=2,
    values=[[0.20, 0.15, 0.03, 0.05, 0.001, 0.02],
            [0.80, 0.85, 0.97, 0.95, 0.999, 0.98]],
    evidence=["Smoker", "Pollution"],
    evidence_card=[2, 3],
)

For Cancer variable, we pass another TabularCPD. evidence_card denotes cardinality of each evidence variable. There are total 2*3=6 different combinations of evidence. Note the values of evidence affects probability of Cancer. We have 6 columns in values denoting conditional probabilities. values sums up to 1 columnwise. The columns are partitioned by the first evidence first. The first three columns correspond to non-smokers.

We next consider another discrete variable D, with a model assumption that it depends on A, B and C.

[4]:
cpd = TabularCPD(
    variable="D",
    variable_card=2,
    values=[[0.20, 0.15, 0.93, 0.05, 0.001, 0.02, 0.10, 0.25 ],
            [0.80, 0.85, 0.07, 0.95, 0.999, 0.98, 0.90, 0.75 ]],
    evidence=["A", "B", "C"],
    evidence_card=[2, 2, 2],
)
[5]:
print(cpd)
+------+------+------+------+------+-------+------+------+------+
| A    | A(0) | A(0) | A(0) | A(0) | A(1)  | A(1) | A(1) | A(1) |
+------+------+------+------+------+-------+------+------+------+
| B    | B(0) | B(0) | B(1) | B(1) | B(0)  | B(0) | B(1) | B(1) |
+------+------+------+------+------+-------+------+------+------+
| C    | C(0) | C(1) | C(0) | C(1) | C(0)  | C(1) | C(0) | C(1) |
+------+------+------+------+------+-------+------+------+------+
| D(0) | 0.2  | 0.15 | 0.93 | 0.05 | 0.001 | 0.02 | 0.1  | 0.25 |
+------+------+------+------+------+-------+------+------+------+
| D(1) | 0.8  | 0.85 | 0.07 | 0.95 | 0.999 | 0.98 | 0.9  | 0.75 |
+------+------+------+------+------+-------+------+------+------+

Each column of values correspond to a different combinations of evidence variables. If the evidence variables were lexicographic as above (A, B, C), then the columns will be also lexicographic.

(A(0)B(0)C(0), A(0)B(0)C(1), A(0)B(1)C(0), A(0)B(1)C(1), ...)

It is as though we loop over A, and B, and C, in a nested order.

CPD with random values

It is also possible to create a random tabular CPD.

[6]:
cpd_coin_random = TabularCPD.get_random(variable="coin", cardinality = {'coin':2}, state_names={'Coin': ['Head', 'Tail']})
cpd_coin_random.get_values()
[6]:
array([[0.55203474],
       [0.44796526]])

Note the probabilities can be retrieved by get_values and they add up to 1 again. If cardinality is missing, the discrete variable is assumed to be binary.

When there are evidence, the cardinality of evidence needs to be also specified.

[7]:
cpd_cancer_random = TabularCPD.get_random(
    variable="Cancer",
    cardinality = {"Cancer": 2, "Smoker": 2, "Pollution": 3},
    evidence=["Smoker", "Pollution"],
)
cpd_cancer_random.get_values()
[7]:
array([[0.30404579, 0.99344749, 0.47933983, 0.58564998, 0.54269003,
        0.28164646],
       [0.69595421, 0.00655251, 0.52066017, 0.41435002, 0.45730997,
        0.71835354]])
[8]:
cpd_cancer_random.get_values().sum(axis = 0)
[8]:
array([1., 1., 1., 1., 1., 1.])

Note again the probability table sums up to 1 for each column.

LinearGaussianCPD for continuous variables

We define CPDs for each variable. LinearGaussianCPD assumes that each variable takes normal distribution. Assume Healthy and Wealthy variables have no parents (empty evidence).

[9]:
# Step 2: Define the CPDs.
from pgmpy.factors.continuous import LinearGaussianCPD
cpd_healthy = LinearGaussianCPD(variable="Healthy", beta=[4], std=2, evidence=[])
cpd_wealthy = LinearGaussianCPD(variable="Wealthy", beta=[2], std=3, evidence=[])

import pprint
pprint.pp(cpd_healthy)
pprint.pp(cpd_wealthy)
<LinearGaussianCPD: P(Healthy) = N(4; 2) at 0x15735ed20
<LinearGaussianCPD: P(Wealthy) = N(2; 3) at 0x1566c2930

Above we defined, Healthy as normal distribution with mean 4 and standard deviation 2, and wealthy with mean 2, and standard deviation 3. (Assume bigger variation in people’s wealth than health, bigger mean for health than wealth)

[10]:
cpd_happy = LinearGaussianCPD(
    variable="Happy",
    beta=[1, 3, 2],
    std=5,
    evidence=["Healthy", "Wealthy"],
)
pprint.pp(cpd_happy)
<LinearGaussianCPD: P(Happy | Healthy, Wealthy) = N(3*Healthy + 2*Wealthy + 1; 5) at 0x154160050

The Happy variable has mean 3*Healthy + 2*Wealthy + 1, this formula is determined by passing beta and evidence variables. Note that the first element of beta is the intercept (constant term). The rest of the elements in beta each match the evidence. The standard deviation of normal distribution of Happy is set by std.