Simulating Data From Bayesian Networks¶
pgmpy implements the BayesianNetwork.simulate
method to allow users to simulate data from a fully defined Bayesian Network under various conditions. These conditions can be any combination of: 1. Virtual Evidence 2. Hard Evidence 3. Virtual Intervention 4. Hard Intervention
Lastly, users can also provide data corresponding to some of the variables in the model (partial samples) to the simulation method. This allows users to fix the values of those variables to the specified value.
[1]:
# A helper function to compute probability distributions from simulated samples.
def get_distribution(samples, variables=None):
"""
For marginal distribution, P(A): get_distribution(samples, variables=['A'])
For joint distribution, P(A, B): get_distribution(samples, variables=['A', 'B'])
"""
if variables is None:
raise ValueError("variables must be specified")
return samples.groupby(variables).size() / samples.shape[0]
[2]:
# Do not print warnings
import logging
from pgmpy.global_vars import logger
logger.setLevel(logging.ERROR)
# Specify the model to simulate data from.
from pgmpy.factors.discrete import TabularCPD
from pgmpy.utils import get_example_model
alarm = get_example_model("alarm")
1. Standard simulation¶
Without any specified conditions for simulation, the simulate
method draws samples from the joint distribution of the model.
[3]:
samples = alarm.simulate(n_samples=int(1e4))
samples.head()
[3]:
TPR | PAP | MINVOL | HREKG | EXPCO2 | DISCONNECT | VENTMACH | VENTLUNG | LVEDVOLUME | HR | ... | SHUNT | VENTTUBE | MINVOLSET | LVFAILURE | ERRLOWOUTPUT | HRBP | FIO2 | BP | HISTORY | STROKEVOLUME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LOW | NORMAL | ZERO | NORMAL | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | NORMAL |
1 | HIGH | NORMAL | ZERO | HIGH | LOW | TRUE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | ZERO | NORMAL | FALSE | FALSE | HIGH | NORMAL | HIGH | FALSE | NORMAL |
2 | LOW | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | NORMAL |
3 | LOW | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | NORMAL |
4 | NORMAL | HIGH | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | NORMAL | FALSE | NORMAL |
5 rows × 37 columns
2. Simulation under specified evidence¶
Specifying hard evidence for some variables fixes their values to the specified evidence value during simulation.
[4]:
evidence = {"CVP": "NORMAL", "HR": "HIGH"}
samples = alarm.simulate(n_samples=int(1e4), evidence=evidence)
samples.head()
[4]:
TPR | PAP | MINVOL | HREKG | EXPCO2 | DISCONNECT | VENTMACH | VENTLUNG | LVEDVOLUME | HR | ... | SHUNT | VENTTUBE | MINVOLSET | LVFAILURE | ERRLOWOUTPUT | HRBP | FIO2 | BP | HISTORY | STROKEVOLUME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NORMAL | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | LOW |
1 | LOW | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | LOW |
2 | LOW | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | LOW | LOW | FALSE | NORMAL |
3 | LOW | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | NORMAL |
4 | LOW | NORMAL | NORMAL | NORMAL | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | HIGH | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | LOW | FALSE | NORMAL |
5 rows × 37 columns
[5]:
# All values of HR and CVP should be set to HIGH and NORMAL respectively.
print(all(samples.HR == "HIGH"))
print(all(samples.CVP == "NORMAL"))
True
True
3. Simulation under soft/virtual evidence¶
Unlike hard evidence where the value of the specified variables is fixed to the specified evidence, virtual evidence allows users to set the marginal distribution of variables.
[6]:
# The virtual evidence is specified using TabularCPDs. Here, P(CVP=NORMAL) = 0.2, P(CVP=LOW) = 0.3, and P(CPV=HIGH) = 0.5
cvp_evidence = TabularCPD(variable="CVP",
variable_card=3,
values=[[0.2], [0.3], [0.5]],
state_names={"CVP": ["LOW", "NORMAL", "HIGH"]})
samples = alarm.simulate(n_samples=int(1e4), virtual_evidence=[cvp_evidence])
[7]:
# Check the marginal distribution of CVP
get_distribution(samples, variables=['CVP'])
[7]:
CVP
HIGH 0.2414
LOW 0.0692
NORMAL 0.6894
dtype: float64
4. Simulation under specified intervention¶
Using the do
argument, users can specify interventions to the model. The value of the intervened variables are set to the specified value and all incoming edges to these variables are removed in the model.
[8]:
samples = alarm.simulate(n_samples=int(1e4), do={"CVP": "NORMAL", "HR": "HIGH"})
samples.head()
[8]:
TPR | PAP | MINVOL | HREKG | EXPCO2 | DISCONNECT | VENTMACH | VENTLUNG | LVEDVOLUME | HR | ... | SHUNT | VENTTUBE | MINVOLSET | LVFAILURE | ERRLOWOUTPUT | HRBP | FIO2 | BP | HISTORY | STROKEVOLUME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HIGH | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | LOW | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | NORMAL | NORMAL | HIGH | FALSE | NORMAL |
1 | NORMAL | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | LOW | HIGH | ... | NORMAL | LOW | NORMAL | TRUE | FALSE | HIGH | NORMAL | LOW | FALSE | LOW |
2 | NORMAL | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | LOW | HIGH | ... | NORMAL | LOW | NORMAL | TRUE | FALSE | HIGH | NORMAL | LOW | TRUE | LOW |
3 | LOW | NORMAL | ZERO | HIGH | LOW | FALSE | NORMAL | ZERO | NORMAL | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | LOW | LOW | FALSE | NORMAL |
4 | NORMAL | NORMAL | ZERO | HIGH | HIGH | FALSE | NORMAL | ZERO | HIGH | HIGH | ... | NORMAL | LOW | NORMAL | FALSE | FALSE | HIGH | NORMAL | NORMAL | FALSE | LOW |
5 rows × 37 columns
5. Simulation under soft/virtual intervention¶
Similar to virtual evidence, users can specify virtual intervention as well.
[9]:
cvp_intervention = TabularCPD(variable="CVP",
variable_card=3,
values=[[0.2], [0.3], [0.5]],
state_names={"CVP": ["LOW", "NORMAL", "HIGH"]})
samples = alarm.simulate(n_samples=int(1e4), virtual_intervention=[cvp_intervention])
get_distribution(samples, variables=["CVP"]) # P(HISTORY)
[9]:
CVP
HIGH 0.3814
LOW 0.2110
NORMAL 0.4076
dtype: float64
6. Partial samples¶
Lastly, users can also pass already generated data for some variables (for example, from some other simulation) to the simulation. This is equivalent to separately specifying evidence for each sample that is generate.
[10]:
# Generate some data on CVP.
partial_cvp = pd.DataFrame(np.random.choice(["LOW", "NORMAL", "HIGH"], int(1e4)), columns=['CVP'])
samples = alarm.simulate(n_samples=int(1e4), partial_samples=partial_cvp)
[ ]:
samples.CVP == partial_cvp