Parameter Learning in Discrete Bayesian Networks

In this notebook, we show an example for learning the parameters (CPDs) of a Discrete Bayesian Network given the data and the model structure. pgmpy has two main methods for learning the parameters: 1. MaximumLikelihood Estimator (pgmpy.estimators.MaximumLikelihoodEstimator) 2. Bayesian Estimator (pgmpy.estimators.BayesianEstimator) 3. Expectation Maximization (pgmpy.estimators.ExpectationMaximization)

In the examples, we will try to generate some data from given models and then try to learn the model parameters back from the generated data.

Step 1: Generate some data

[1]:
# Use the alarm model to generate data from it.

from pgmpy.utils import get_example_model
from pgmpy.sampling import BayesianModelSampling

alarm_model = get_example_model("alarm")
samples = BayesianModelSampling(alarm_model).forward_sample(size=int(1e5))
samples.head()
Generating for node: CVP: 100%|██████████| 37/37 [00:01<00:00, 24.08it/s]
[1]:
HISTORY CVP PCWP HYPOVOLEMIA LVEDVOLUME LVFAILURE STROKEVOLUME ERRLOWOUTPUT HRBP HREKG ... MINVOLSET VENTMACH VENTTUBE VENTLUNG VENTALV ARTCO2 CATECHOL HR CO BP
0 FALSE NORMAL NORMAL FALSE NORMAL FALSE NORMAL FALSE HIGH HIGH ... NORMAL NORMAL LOW ZERO ZERO HIGH HIGH HIGH HIGH HIGH
1 FALSE NORMAL NORMAL FALSE NORMAL FALSE NORMAL TRUE LOW LOW ... NORMAL NORMAL LOW ZERO ZERO HIGH HIGH LOW LOW LOW
2 FALSE LOW LOW TRUE LOW TRUE LOW FALSE HIGH NORMAL ... NORMAL NORMAL ZERO LOW HIGH LOW HIGH HIGH LOW LOW
3 FALSE NORMAL NORMAL FALSE NORMAL FALSE NORMAL FALSE HIGH HIGH ... NORMAL NORMAL LOW ZERO ZERO HIGH HIGH HIGH HIGH HIGH
4 FALSE HIGH HIGH TRUE HIGH FALSE NORMAL TRUE NORMAL HIGH ... NORMAL NORMAL ZERO HIGH LOW HIGH HIGH HIGH HIGH HIGH

5 rows × 37 columns

Step 2: Define a model structure

In this case, since we are trying to learn the model parameters back we will use the model structure that we used to generate the data from.

[2]:
# Defining the Bayesian Model structure

from pgmpy.models import BayesianNetwork

model_struct = BayesianNetwork(ebunch=alarm_model.edges())
model_struct.nodes()
[2]:
NodeView(('HYPOVOLEMIA', 'LVEDVOLUME', 'STROKEVOLUME', 'CVP', 'PCWP', 'LVFAILURE', 'HISTORY', 'CO', 'ERRLOWOUTPUT', 'HRBP', 'ERRCAUTER', 'HREKG', 'HRSAT', 'INSUFFANESTH', 'CATECHOL', 'ANAPHYLAXIS', 'TPR', 'BP', 'KINKEDTUBE', 'PRESS', 'VENTLUNG', 'FIO2', 'PVSAT', 'SAO2', 'PULMEMBOLUS', 'PAP', 'SHUNT', 'INTUBATION', 'MINVOL', 'VENTALV', 'DISCONNECT', 'VENTTUBE', 'MINVOLSET', 'VENTMACH', 'EXPCO2', 'ARTCO2', 'HR'))

Step 3: Learning the model parameters

[3]:
# Fitting the model using Maximum Likelihood Estimator

from pgmpy.estimators import MaximumLikelihoodEstimator

mle = MaximumLikelihoodEstimator(model=model_struct, data=samples)

# Estimating the CPD for a single node.
print(mle.estimate_cpd(node="FIO2"))
print(mle.estimate_cpd(node="CVP"))

# Estimating CPDs for all the nodes in the model
mle.get_parameters()[:10]  # Show just the first 10 CPDs in the output
+--------------+---------+
| FIO2(LOW)    | 0.04859 |
+--------------+---------+
| FIO2(NORMAL) | 0.95141 |
+--------------+---------+
+-------------+----------------------+-----------------------+----------------------+
| LVEDVOLUME  | LVEDVOLUME(HIGH)     | LVEDVOLUME(LOW)       | LVEDVOLUME(NORMAL)   |
+-------------+----------------------+-----------------------+----------------------+
| CVP(HIGH)   | 0.702671646078713    | 0.0069145318521877126 | 0.010257212769589711 |
+-------------+----------------------+-----------------------+----------------------+
| CVP(LOW)    | 0.009480034472852629 | 0.9526184538653366    | 0.03999032606840039  |
+-------------+----------------------+-----------------------+----------------------+
| CVP(NORMAL) | 0.28784831944843436  | 0.04046701428247563   | 0.94975246116201     |
+-------------+----------------------+-----------------------+----------------------+
[3]:
[<TabularCPD representing P(HYPOVOLEMIA:2) at 0x7f472f27b880>,
 <TabularCPD representing P(LVEDVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f27b8b0>,
 <TabularCPD representing P(STROKEVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f285460>,
 <TabularCPD representing P(CVP:3 | LVEDVOLUME:3) at 0x7f472f2c8520>,
 <TabularCPD representing P(PCWP:3 | LVEDVOLUME:3) at 0x7f472f2c8280>,
 <TabularCPD representing P(LVFAILURE:2) at 0x7f472f2d4400>,
 <TabularCPD representing P(HISTORY:2 | LVFAILURE:2) at 0x7f472f2ec280>,
 <TabularCPD representing P(CO:3 | HR:3, STROKEVOLUME:3) at 0x7f472f2c0a30>,
 <TabularCPD representing P(ERRLOWOUTPUT:2) at 0x7f472f2ec3d0>,
 <TabularCPD representing P(HRBP:3 | ERRLOWOUTPUT:2, HR:3) at 0x7f472f2d46a0>]
[4]:
# Verifying that the learned parameters are almost equal.
np.allclose(
    alarm_model.get_cpds("FIO2").values, mle.estimate_cpd("FIO2").values, atol=0.01
)
[4]:
True
[5]:
# Fitting the using Bayesian Estimator
from pgmpy.estimators import BayesianEstimator

best = BayesianEstimator(model=model_struct, data=samples)

print(best.estimate_cpd(node="FIO2", prior_type="BDeu", equivalent_sample_size=1000))
# Uniform pseudo count for each state. Can also accept an array of the size of CPD.
print(best.estimate_cpd(node="CVP", prior_type="dirichlet", pseudo_counts=100))

# Learning CPDs for all the nodes in the model. For learning all parameters with BDeU prior, a dict of
# pseudo_counts need to be provided
best.get_parameters(prior_type="BDeu", equivalent_sample_size=1000)[:10]
+--------------+-----------+
| FIO2(LOW)    | 0.0530594 |
+--------------+-----------+
| FIO2(NORMAL) | 0.946941  |
+--------------+-----------+
+-------------+----------------------+----------------------+----------------------+
| LVEDVOLUME  | LVEDVOLUME(HIGH)     | LVEDVOLUME(LOW)      | LVEDVOLUME(NORMAL)   |
+-------------+----------------------+----------------------+----------------------+
| CVP(HIGH)   | 0.6974417067875012   | 0.017649638237228676 | 0.011630213055303717 |
+-------------+----------------------+----------------------+----------------------+
| CVP(LOW)    | 0.014065892570565468 | 0.9322516991887744   | 0.041236967361740706 |
+-------------+----------------------+----------------------+----------------------+
| CVP(NORMAL) | 0.2884924006419334   | 0.05009866257399693  | 0.9471328195829556   |
+-------------+----------------------+----------------------+----------------------+
[5]:
[<TabularCPD representing P(HYPOVOLEMIA:2) at 0x7f472f296910>,
 <TabularCPD representing P(LVEDVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f28ec10>,
 <TabularCPD representing P(STROKEVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f27b280>,
 <TabularCPD representing P(CVP:3 | LVEDVOLUME:3) at 0x7f472f296880>,
 <TabularCPD representing P(PCWP:3 | LVEDVOLUME:3) at 0x7f472f28ee20>,
 <TabularCPD representing P(LVFAILURE:2) at 0x7f472f2e2b50>,
 <TabularCPD representing P(HISTORY:2 | LVFAILURE:2) at 0x7f472fb7dbe0>,
 <TabularCPD representing P(CO:3 | HR:3, STROKEVOLUME:3) at 0x7f472f03ddc0>,
 <TabularCPD representing P(ERRLOWOUTPUT:2) at 0x7f472f2dae20>,
 <TabularCPD representing P(HRBP:3 | ERRLOWOUTPUT:2, HR:3) at 0x7f472f296ee0>]
[6]:
# Shortcut for learning all the parameters and adding the CPDs to the model.

model_struct = BayesianNetwork(ebunch=alarm_model.edges())
model_struct.fit(data=samples, estimator=MaximumLikelihoodEstimator)
print(model_struct.get_cpds("FIO2"))

model_struct = BayesianNetwork(ebunch=alarm_model.edges())
model_struct.fit(
    data=samples,
    estimator=BayesianEstimator,
    prior_type="BDeu",
    equivalent_sample_size=1000,
)
print(model_struct.get_cpds("FIO2"))
+--------------+---------+
| FIO2(LOW)    | 0.04859 |
+--------------+---------+
| FIO2(NORMAL) | 0.95141 |
+--------------+---------+
+--------------+-----------+
| FIO2(LOW)    | 0.0530594 |
+--------------+-----------+
| FIO2(NORMAL) | 0.946941  |
+--------------+-----------+

The Expecation Maximization (EM) algorithm can also learn the parameters when we have some latent variables in the model.

[ ]:
from pgmpy.estimators import ExpectationMaximization as EM

# Define a model structure with latent variables
model_latent = BayesianNetwork(
    ebunch=alarm_model.edges(), latents=["HYPOVOLEMIA", "LVEDVOLUME", "STROKEVOLUME"]
)

# Dataset for latent model which doesn't have values for the latent variables
samples_latent = samples.drop(model_latent.latents, axis=1)

model_latent.fit(samples_latent, estimator=EM)
 11%|█         | 11/100 [28:03<3:46:14, 152.52s/it]