Parameter Learning in Discrete Bayesian Networks¶
In this notebook, we show an example for learning the parameters (CPDs) of a Discrete Bayesian Network given the data and the model structure. pgmpy has two main methods for learning the parameters: 1. MaximumLikelihood Estimator (pgmpy.estimators.MaximumLikelihoodEstimator) 2. Bayesian Estimator (pgmpy.estimators.BayesianEstimator) 3. Expectation Maximization (pgmpy.estimators.ExpectationMaximization)
In the examples, we will try to generate some data from given models and then try to learn the model parameters back from the generated data.
Step 1: Generate some data¶
[1]:
# Use the alarm model to generate data from it.
from pgmpy.utils import get_example_model
from pgmpy.sampling import BayesianModelSampling
alarm_model = get_example_model("alarm")
samples = BayesianModelSampling(alarm_model).forward_sample(size=int(1e5))
samples.head()
Generating for node: CVP: 100%|██████████| 37/37 [00:01<00:00, 24.08it/s]
[1]:
HISTORY | CVP | PCWP | HYPOVOLEMIA | LVEDVOLUME | LVFAILURE | STROKEVOLUME | ERRLOWOUTPUT | HRBP | HREKG | ... | MINVOLSET | VENTMACH | VENTTUBE | VENTLUNG | VENTALV | ARTCO2 | CATECHOL | HR | CO | BP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FALSE | NORMAL | NORMAL | FALSE | NORMAL | FALSE | NORMAL | FALSE | HIGH | HIGH | ... | NORMAL | NORMAL | LOW | ZERO | ZERO | HIGH | HIGH | HIGH | HIGH | HIGH |
1 | FALSE | NORMAL | NORMAL | FALSE | NORMAL | FALSE | NORMAL | TRUE | LOW | LOW | ... | NORMAL | NORMAL | LOW | ZERO | ZERO | HIGH | HIGH | LOW | LOW | LOW |
2 | FALSE | LOW | LOW | TRUE | LOW | TRUE | LOW | FALSE | HIGH | NORMAL | ... | NORMAL | NORMAL | ZERO | LOW | HIGH | LOW | HIGH | HIGH | LOW | LOW |
3 | FALSE | NORMAL | NORMAL | FALSE | NORMAL | FALSE | NORMAL | FALSE | HIGH | HIGH | ... | NORMAL | NORMAL | LOW | ZERO | ZERO | HIGH | HIGH | HIGH | HIGH | HIGH |
4 | FALSE | HIGH | HIGH | TRUE | HIGH | FALSE | NORMAL | TRUE | NORMAL | HIGH | ... | NORMAL | NORMAL | ZERO | HIGH | LOW | HIGH | HIGH | HIGH | HIGH | HIGH |
5 rows × 37 columns
Step 2: Define a model structure¶
In this case, since we are trying to learn the model parameters back we will use the model structure that we used to generate the data from.
[2]:
# Defining the Bayesian Model structure
from pgmpy.models import BayesianNetwork
model_struct = BayesianNetwork(ebunch=alarm_model.edges())
model_struct.nodes()
[2]:
NodeView(('HYPOVOLEMIA', 'LVEDVOLUME', 'STROKEVOLUME', 'CVP', 'PCWP', 'LVFAILURE', 'HISTORY', 'CO', 'ERRLOWOUTPUT', 'HRBP', 'ERRCAUTER', 'HREKG', 'HRSAT', 'INSUFFANESTH', 'CATECHOL', 'ANAPHYLAXIS', 'TPR', 'BP', 'KINKEDTUBE', 'PRESS', 'VENTLUNG', 'FIO2', 'PVSAT', 'SAO2', 'PULMEMBOLUS', 'PAP', 'SHUNT', 'INTUBATION', 'MINVOL', 'VENTALV', 'DISCONNECT', 'VENTTUBE', 'MINVOLSET', 'VENTMACH', 'EXPCO2', 'ARTCO2', 'HR'))
Step 3: Learning the model parameters¶
[3]:
# Fitting the model using Maximum Likelihood Estimator
from pgmpy.estimators import MaximumLikelihoodEstimator
mle = MaximumLikelihoodEstimator(model=model_struct, data=samples)
# Estimating the CPD for a single node.
print(mle.estimate_cpd(node="FIO2"))
print(mle.estimate_cpd(node="CVP"))
# Estimating CPDs for all the nodes in the model
mle.get_parameters()[:10] # Show just the first 10 CPDs in the output
+--------------+---------+
| FIO2(LOW) | 0.04859 |
+--------------+---------+
| FIO2(NORMAL) | 0.95141 |
+--------------+---------+
+-------------+----------------------+-----------------------+----------------------+
| LVEDVOLUME | LVEDVOLUME(HIGH) | LVEDVOLUME(LOW) | LVEDVOLUME(NORMAL) |
+-------------+----------------------+-----------------------+----------------------+
| CVP(HIGH) | 0.702671646078713 | 0.0069145318521877126 | 0.010257212769589711 |
+-------------+----------------------+-----------------------+----------------------+
| CVP(LOW) | 0.009480034472852629 | 0.9526184538653366 | 0.03999032606840039 |
+-------------+----------------------+-----------------------+----------------------+
| CVP(NORMAL) | 0.28784831944843436 | 0.04046701428247563 | 0.94975246116201 |
+-------------+----------------------+-----------------------+----------------------+
[3]:
[<TabularCPD representing P(HYPOVOLEMIA:2) at 0x7f472f27b880>,
<TabularCPD representing P(LVEDVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f27b8b0>,
<TabularCPD representing P(STROKEVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f285460>,
<TabularCPD representing P(CVP:3 | LVEDVOLUME:3) at 0x7f472f2c8520>,
<TabularCPD representing P(PCWP:3 | LVEDVOLUME:3) at 0x7f472f2c8280>,
<TabularCPD representing P(LVFAILURE:2) at 0x7f472f2d4400>,
<TabularCPD representing P(HISTORY:2 | LVFAILURE:2) at 0x7f472f2ec280>,
<TabularCPD representing P(CO:3 | HR:3, STROKEVOLUME:3) at 0x7f472f2c0a30>,
<TabularCPD representing P(ERRLOWOUTPUT:2) at 0x7f472f2ec3d0>,
<TabularCPD representing P(HRBP:3 | ERRLOWOUTPUT:2, HR:3) at 0x7f472f2d46a0>]
[4]:
# Verifying that the learned parameters are almost equal.
np.allclose(
alarm_model.get_cpds("FIO2").values, mle.estimate_cpd("FIO2").values, atol=0.01
)
[4]:
True
[5]:
# Fitting the using Bayesian Estimator
from pgmpy.estimators import BayesianEstimator
best = BayesianEstimator(model=model_struct, data=samples)
print(best.estimate_cpd(node="FIO2", prior_type="BDeu", equivalent_sample_size=1000))
# Uniform pseudo count for each state. Can also accept an array of the size of CPD.
print(best.estimate_cpd(node="CVP", prior_type="dirichlet", pseudo_counts=100))
# Learning CPDs for all the nodes in the model. For learning all parameters with BDeU prior, a dict of
# pseudo_counts need to be provided
best.get_parameters(prior_type="BDeu", equivalent_sample_size=1000)[:10]
+--------------+-----------+
| FIO2(LOW) | 0.0530594 |
+--------------+-----------+
| FIO2(NORMAL) | 0.946941 |
+--------------+-----------+
+-------------+----------------------+----------------------+----------------------+
| LVEDVOLUME | LVEDVOLUME(HIGH) | LVEDVOLUME(LOW) | LVEDVOLUME(NORMAL) |
+-------------+----------------------+----------------------+----------------------+
| CVP(HIGH) | 0.6974417067875012 | 0.017649638237228676 | 0.011630213055303717 |
+-------------+----------------------+----------------------+----------------------+
| CVP(LOW) | 0.014065892570565468 | 0.9322516991887744 | 0.041236967361740706 |
+-------------+----------------------+----------------------+----------------------+
| CVP(NORMAL) | 0.2884924006419334 | 0.05009866257399693 | 0.9471328195829556 |
+-------------+----------------------+----------------------+----------------------+
[5]:
[<TabularCPD representing P(HYPOVOLEMIA:2) at 0x7f472f296910>,
<TabularCPD representing P(LVEDVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f28ec10>,
<TabularCPD representing P(STROKEVOLUME:3 | HYPOVOLEMIA:2, LVFAILURE:2) at 0x7f472f27b280>,
<TabularCPD representing P(CVP:3 | LVEDVOLUME:3) at 0x7f472f296880>,
<TabularCPD representing P(PCWP:3 | LVEDVOLUME:3) at 0x7f472f28ee20>,
<TabularCPD representing P(LVFAILURE:2) at 0x7f472f2e2b50>,
<TabularCPD representing P(HISTORY:2 | LVFAILURE:2) at 0x7f472fb7dbe0>,
<TabularCPD representing P(CO:3 | HR:3, STROKEVOLUME:3) at 0x7f472f03ddc0>,
<TabularCPD representing P(ERRLOWOUTPUT:2) at 0x7f472f2dae20>,
<TabularCPD representing P(HRBP:3 | ERRLOWOUTPUT:2, HR:3) at 0x7f472f296ee0>]
[6]:
# Shortcut for learning all the parameters and adding the CPDs to the model.
model_struct = BayesianNetwork(ebunch=alarm_model.edges())
model_struct.fit(data=samples, estimator=MaximumLikelihoodEstimator)
print(model_struct.get_cpds("FIO2"))
model_struct = BayesianNetwork(ebunch=alarm_model.edges())
model_struct.fit(
data=samples,
estimator=BayesianEstimator,
prior_type="BDeu",
equivalent_sample_size=1000,
)
print(model_struct.get_cpds("FIO2"))
+--------------+---------+
| FIO2(LOW) | 0.04859 |
+--------------+---------+
| FIO2(NORMAL) | 0.95141 |
+--------------+---------+
+--------------+-----------+
| FIO2(LOW) | 0.0530594 |
+--------------+-----------+
| FIO2(NORMAL) | 0.946941 |
+--------------+-----------+
The Expecation Maximization (EM) algorithm can also learn the parameters when we have some latent variables in the model.
[ ]:
from pgmpy.estimators import ExpectationMaximization as EM
# Define a model structure with latent variables
model_latent = BayesianNetwork(
ebunch=alarm_model.edges(), latents=["HYPOVOLEMIA", "LVEDVOLUME", "STROKEVOLUME"]
)
# Dataset for latent model which doesn't have values for the latent variables
samples_latent = samples.drop(model_latent.latents, axis=1)
model_latent.fit(samples_latent, estimator=EM)
11%|█ | 11/100 [28:03<3:46:14, 152.52s/it]