How to define TabularCPD and LinearGaussianCPD¶
One can implement TabularCPD
for discrete variables inside DiscreteBayesianNetwork
. and LinearGaussianCPD
for continuous variables inside LinearGaussianBayesianNetwork
.
In this tutorial, we will demonstrate how to define each CPD.
TabularCPD for discrete variables¶
In tabular CPD, the probability for discrete variable is given as a table. Let us start with examples for independent variables.
[1]:
from pgmpy.factors.discrete import TabularCPD
cpd_coin_fair = TabularCPD(variable="coin", variable_card=2, values=[[0.5], [0.5]], state_names={'Coin': ['Head', 'Tail']})
cpd_coin_biased = TabularCPD(variable="coin", variable_card=2, values=[[0.9], [0.1]], state_names={'Coin': ['Head', 'Tail']})
For a coin flip, we have discrete number (2) possible outcomes, therefore variable_card=2
is passed. values
pass probabilities for each outcome, which sum up to 1. state_names
is optional, it gives names for each outcome.
[2]:
cpd_smoke = TabularCPD(variable="Smoker", variable_card=2, values=[[0.3], [0.7]], state_names={'Smoker': ['Non-smoker', 'Smoker']})
cpd_pollution = TabularCPD(variable="Pollution", variable_card=3, values=[[0.7], [0.29], [0.01]], state_names={'Pollution': ['Clean', 'Bad', 'Fatal']})
The Pollution
and Smoker
variables do not depend on other variables. They take categorical values. variable_card
denotes how many categories the variable can take. For example Smoker
can take a binary value since variable_card=2
. values
is an array of probability values for each category. Note that the probability values
sum up to 1.
Tabular CPD with multiple evidence.¶
We next consider another discrete variable Cancer
, with a model assumption that it depends on Pollution
and Smoker
.
[3]:
cpd_cancer = TabularCPD(
variable="Cancer",
variable_card=2,
values=[[0.20, 0.15, 0.03, 0.05, 0.001, 0.02],
[0.80, 0.85, 0.97, 0.95, 0.999, 0.98]],
evidence=["Smoker", "Pollution"],
evidence_card=[2, 3],
)
For Cancer
variable, we pass another TabularCPD
. evidence_card
denotes cardinality of each evidence variable. There are total 2*3=6
different combinations of evidence. Note the values of evidence
affects probability of Cancer
. We have 6 columns in values
denoting conditional probabilities. values
sums up to 1 columnwise. The columns are partitioned by the first evidence first. The first three columns correspond to non-smokers.
We next consider another discrete variable D
, with a model assumption that it depends on A
, B
and C
.
[4]:
cpd = TabularCPD(
variable="D",
variable_card=2,
values=[[0.20, 0.15, 0.93, 0.05, 0.001, 0.02, 0.10, 0.25 ],
[0.80, 0.85, 0.07, 0.95, 0.999, 0.98, 0.90, 0.75 ]],
evidence=["A", "B", "C"],
evidence_card=[2, 2, 2],
)
[5]:
print(cpd)
+------+------+------+------+------+-------+------+------+------+
| A | A(0) | A(0) | A(0) | A(0) | A(1) | A(1) | A(1) | A(1) |
+------+------+------+------+------+-------+------+------+------+
| B | B(0) | B(0) | B(1) | B(1) | B(0) | B(0) | B(1) | B(1) |
+------+------+------+------+------+-------+------+------+------+
| C | C(0) | C(1) | C(0) | C(1) | C(0) | C(1) | C(0) | C(1) |
+------+------+------+------+------+-------+------+------+------+
| D(0) | 0.2 | 0.15 | 0.93 | 0.05 | 0.001 | 0.02 | 0.1 | 0.25 |
+------+------+------+------+------+-------+------+------+------+
| D(1) | 0.8 | 0.85 | 0.07 | 0.95 | 0.999 | 0.98 | 0.9 | 0.75 |
+------+------+------+------+------+-------+------+------+------+
Each column of values
correspond to a different combinations of evidence
variables. If the evidence variables were lexicographic as above (A, B, C)
, then the columns will be also lexicographic.
(A(0)B(0)C(0), A(0)B(0)C(1), A(0)B(1)C(0), A(0)B(1)C(1), ...)
It is as though we loop over A, and B, and C, in a nested order.
CPD with random values¶
It is also possible to create a random tabular CPD.
[6]:
cpd_coin_random = TabularCPD.get_random(variable="coin", cardinality = {'coin':2}, state_names={'Coin': ['Head', 'Tail']})
cpd_coin_random.get_values()
[6]:
array([[0.55203474],
[0.44796526]])
Note the probabilities can be retrieved by get_values
and they add up to 1 again. If cardinality
is missing, the discrete variable is assumed to be binary.
When there are evidence
, the cardinality of evidence needs to be also specified.
[7]:
cpd_cancer_random = TabularCPD.get_random(
variable="Cancer",
cardinality = {"Cancer": 2, "Smoker": 2, "Pollution": 3},
evidence=["Smoker", "Pollution"],
)
cpd_cancer_random.get_values()
[7]:
array([[0.30404579, 0.99344749, 0.47933983, 0.58564998, 0.54269003,
0.28164646],
[0.69595421, 0.00655251, 0.52066017, 0.41435002, 0.45730997,
0.71835354]])
[8]:
cpd_cancer_random.get_values().sum(axis = 0)
[8]:
array([1., 1., 1., 1., 1., 1.])
Note again the probability table sums up to 1 for each column.
LinearGaussianCPD for continuous variables¶
We define CPDs for each variable. LinearGaussianCPD
assumes that each variable takes normal distribution. Assume Healthy
and Wealthy
variables have no parents (empty evidence).
[9]:
# Step 2: Define the CPDs.
from pgmpy.factors.continuous import LinearGaussianCPD
cpd_healthy = LinearGaussianCPD(variable="Healthy", beta=[4], std=2, evidence=[])
cpd_wealthy = LinearGaussianCPD(variable="Wealthy", beta=[2], std=3, evidence=[])
import pprint
pprint.pp(cpd_healthy)
pprint.pp(cpd_wealthy)
<LinearGaussianCPD: P(Healthy) = N(4; 2) at 0x15735ed20
<LinearGaussianCPD: P(Wealthy) = N(2; 3) at 0x1566c2930
Above we defined, Healthy as normal distribution with mean 4 and standard deviation 2, and wealthy with mean 2, and standard deviation 3. (Assume bigger variation in people’s wealth than health, bigger mean for health than wealth)
[10]:
cpd_happy = LinearGaussianCPD(
variable="Happy",
beta=[1, 3, 2],
std=5,
evidence=["Healthy", "Wealthy"],
)
pprint.pp(cpd_happy)
<LinearGaussianCPD: P(Happy | Healthy, Wealthy) = N(3*Healthy + 2*Wealthy + 1; 5) at 0x154160050
The Happy
variable has mean 3*Healthy + 2*Wealthy + 1
, this formula is determined by passing beta
and evidence
variables. Note that the first element of beta
is the intercept (constant term). The rest of the elements in beta
each match the evidence. The standard deviation of normal distribution of Happy
is set by std
.