Expectation Maximization (EM)

class pgmpy.estimators.ExpectationMaximization(model, data, **kwargs)[source]

Class used to compute parameters for a model using Expectation Maximization (EM).

EM is an iterative algorithm commonly used for estimation in the case when there are latent variables in the model. The algorithm iteratively improves the parameter estimates maximizing the likelihood of the given data.

Parameters:
  • model (A pgmpy.models.BayesianNetwork instance) –

  • data (pandas DataFrame object) – DataFrame object with column names identical to the variable names of the network. (If some values in the data are missing the data cells should be set to numpy.NaN. Note that pandas converts each column containing numpy.NaN`s to dtype `float.)

  • state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.estimators import ExpectationMaximization
>>> data = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 5)),
...                       columns=['A', 'B', 'C', 'D', 'E'])
>>> model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')])
>>> estimator = ExpectationMaximization(model, data)
get_parameters(latent_card=None, max_iter=100, atol=1e-08, n_jobs=1, batch_size=1000, seed=None, init_cpds={}, show_progress=True)[source]

Method to estimate all model parameters (CPDs) using Expecation Maximization.

Parameters:
  • latent_card (dict (default: None)) – A dictionary of the form {latent_var: cardinality} specifying the cardinality (number of states) of each latent variable. If None, assumes 2 states for each latent variable.

  • max_iter (int (default: 100)) – The maximum number of iterations the algorithm is allowed to run for. If max_iter is reached, return the last value of parameters.

  • atol (int (default: 1e-08)) – The absolute accepted tolerance for checking convergence. If the parameters change is less than atol in an iteration, the algorithm will exit.

  • n_jobs (int (default: 1)) – Number of jobs to run in parallel. Using n_jobs > 1 for small models or datasets might be slower.

  • batch_size (int (default: 1000)) – Number of data used to compute weights in a batch.

  • seed (int) – The random seed to use for generating the intial values.

  • init_cpds (dict) – A dictionary of the form {variable: instance of TabularCPD} specifying the initial CPD values for the EM optimizer to start with. If not specified, CPDs involving latent variables are initialized randomly, and CPDs involving only observed variables are initialized with their MLE estimates.

  • show_progress (boolean (default: True)) – Whether to show a progress bar for iterations.

Returns:

Estimated paramters (CPDs) – A list of estimated CPDs for the model.

Return type:

list

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from pgmpy.models import BayesianNetwork
>>> from pgmpy.estimators import ExpectationMaximization as EM
>>> data = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 3)),
...                       columns=['A', 'C', 'D'])
>>> model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D')], latents={'B'})
>>> estimator = EM(model, data)
>>> estimator.get_parameters(latent_card={'B': 3})
[<TabularCPD representing P(C:2) at 0x7f7b534251d0>,
<TabularCPD representing P(B:3 | C:2, A:2) at 0x7f7b4dfd4da0>,
<TabularCPD representing P(A:2) at 0x7f7b4dfd4fd0>,
<TabularCPD representing P(D:2 | C:2) at 0x7f7b4df822b0>]