Structural Equation Models (SEM)¶

class pgmpy.models.SEM.SEM(syntax, **kwargs)[source]¶

Class for representing Structural Equation Models. This class is a wrapper over SEMGraph and SEMAlg to provide a consistent API over the different representations.

model¶

A graphical representation of the model.

Type:: SEMGraph instance

fit()[source]¶

Estimates the CPD for each variable based on a given data set.

Parameters:

data (pandas DataFrame object) – DataFrame object with column names identical to the variable names of the network. (If some values in the data are missing the data cells should be set to numpy.nan. Note that pandas converts each column containing numpy.nan`s to dtype `float.)
estimator (Estimator class) – One of: - MaximumLikelihoodEstimator (default) - BayesianEstimator: In this case, pass ‘prior_type’ and either ‘pseudo_counts’ or ‘equivalent_sample_size’ as additional keyword arguments. See BayesianEstimator.get_parameters() for usage. - ExpectationMaximization
state_names (dict (optional)) – A dict indicating, for each variable, the discrete set of states that the variable can take. If unspecified, the observed values in the data set are taken to be the only possible states.
n_jobs (int (default: 1)) – Number of threads/processes to use for estimation. Using n_jobs > 1 for small models or datasets might be slower.

Returns:

Fitted Model – Returns a DiscreteBayesianNetwork object with learned CPDs. The DAG structure is preserved, and parameters (CPDs) are added. This allows the DAG to represent both the structure and the parameters of a Bayesian Network.

Return type:

DiscreteBayesianNetwork

Examples

>>> import pandas as pd
>>> from pgmpy.models import DiscreteBayesianNetwork
>>> from pgmpy.base import DAG
>>> data = pd.DataFrame(data={"A": [0, 0, 1], "B": [0, 1, 0], "C": [1, 1, 0]})
>>> model = DAG([("A", "C"), ("B", "C")])
>>> fitted_model = model.fit(data)
>>> fitted_model.get_cpds()
[<TabularCPD representing P(A:2) at 0x17945372c30>,
<TabularCPD representing P(B:2) at 0x17945a19760>,
<TabularCPD representing P(C:2 | A:2, B:2) at 0x17944f42690>]

classmethod from_RAM(variables, B, zeta, observed=None, wedge_y=None, fixed_values=None)[source]¶

Initializes a SEM instance using Reticular Action Model(RAM) notation. The model is defined as:

..math:

\mathbf{\eta} = \mathbf{B \eta} + \mathbf{\epsilon} \\
\mathbf{\y} = \wedge_y \mathbf{\eta}
\zeta = COV(\mathbf{\epsilon})

where $\mathbf{\eta}$ is the set of variables (both latent and observed), $\mathbf{\epsilon}$ are the error terms, $\mathbf{y}$ is the set of observed variables, $\wedge_y$ is a boolean array of the shape (no of observed variables, no of total variables).

Parameters:

variables (list, array-like) – List of variables (both latent and observed) in the model.
B (2-D boolean array (shape: len(variables) x len(variables))) – The non-zero parameters in $B$ matrix. Refer model definition in docstring for details.
zeta (2-D boolean array (shape: len(variables) x len(variables))) – The non-zero parameters in $\zeta$ (error covariance) matrix. Refer model definition in docstring for details.
observed (list, array-like (optional: Either observed or wedge_y needs to be specified)) – List of observed variables in the model.
wedge_y (2-D array (shape: no. observed x total vars) (optional: Either observed or wedge_y)) – The $\wedge_y$ matrix. Refer model definition in docstring for details.
fixed_values (dict (optional)) – If specified, fixes the parameter values and are not changed during estimation. A dict with the keys B, zeta.

Returns:

pgmpy.models.SEM instance

Return type:

An instance of the object with initialized values.

Examples

>>> from pgmpy.models import SEM
>>> SEM.from_RAM  # TODO: Finish this

classmethod from_graph(ebunch, latents=[], err_corr=[], err_var={})[source]¶

Initializes a SEM instance using graphical structure.

Parameters:

ebunch (list/array-like) –
List of edges in form of tuples. Each tuple can be of two possible shape:
1. (u, v): This would add an edge from u to v without setting any parameter
  for the edge.
2. (u, v, parameter): This would add an edge from u to v and set the edge’s
  parameter to parameter.
latents (list/array-like) – List of nodes which are latent. All other variables are considered observed.
err_corr (list/array-like) –
List of tuples representing edges between error terms. It can be of the following forms:
1. (u, v): Add correlation between error terms of u and v. Doesn’t set any variance or
  covariance values.
2. (u, v, covar): Adds correlation between the error terms of u and v and sets the
  parameter to covar.
err_var (dict) – Dict of the form (var: variance).

Examples

Defining a model (Union sentiment model[1]) without setting any paramaters.

>>> from pgmpy.models import SEM
>>> sem = SEM.from_graph(
...     ebunch=[
...         ("deferenc", "unionsen"),
...         ("laboract", "unionsen"),
...         ("yrsmill", "unionsen"),
...         ("age", "deferenc"),
...         ("age", "laboract"),
...         ("deferenc", "laboract"),
...     ],
...     latents=[],
...     err_corr=[("yrsmill", "age")],
...     err_var={},
... )

Defining a model (Education [2]) with all the parameters set. For not setting any parameter np.nan can be explicitly passed.

>>> sem_edu = SEM.from_graph(
...     ebunch=[
...         ("intelligence", "academic", 0.8),
...         ("intelligence", "scale_1", 0.7),
...         ("intelligence", "scale_2", 0.64),
...         ("intelligence", "scale_3", 0.73),
...         ("intelligence", "scale_4", 0.82),
...         ("academic", "SAT_score", 0.98),
...         ("academic", "High_school_gpa", 0.75),
...         ("academic", "ACT_score", 0.87),
...     ],
...     latents=["intelligence", "academic"],
...     err_corr=[],
...     err_var={},
... )

References

[1] McDonald, A, J., & Clelland, D. A. (1984). Textile Workers and Union Sentiment.: Social Forces, 63(2), 502–521
[2] https://en.wikipedia.org/wiki/Structural_equation_modeling#/: media/File:Example_Structural_equation_model.svg

classmethod from_lavaan(string=None, filename=None)[source]¶

Initializes a SEM instance using lavaan syntax.

Parameters:

string (str (default: None)) – A lavaan style multiline set of regression equation representing the model. Refer http://lavaan.ugent.be/tutorial/syntax1.html for details.
filename (str (default: None)) – The filename of the file containing the model in lavaan syntax.

Examples

classmethod from_lisrel(var_names, params, fixed_masks=None)[source]¶

Initializes a SEM instance using LISREL notation. The LISREL notation is defined as: ..math:

\mathbf{\eta} = \mathbf{B \eta} + \mathbf{\Gamma \xi} + mathbf{\zeta} \\
\mathbf{y} = \mathbf{\wedge_y \eta} + \mathbf{\epsilon} \\
\mathbf{x} = \mathbf{\wedge_x \xi} + \mathbf{\delta}

where $\mathbf{\eta}$ is the set of endogenous variables, $\mathbf{\xi}$ is the set of exogeneous variables, $\mathbf{y}$ and $\mathbf{x}$ are the set of measurement variables for $\mathbf{\eta}$ and $\mathbf{\xi}$ respectively. $\mathbf{\zeta}$ , $\mathbf{\epsilon}$ , and $\mathbf{\delta}$ are the error terms for $\mathbf{\eta}$ , $\mathbf{y}$ , and $\mathbf{x}$ respectively.

Parameters:

str_model (str (default: None)) –
A lavaan style multiline set of regression equation representing the model. Refer http://lavaan.ugent.be/tutorial/syntax1.html for details.

If None requires var_names and params to be specified.
var_names (dict (default: None)) – A dict with the keys: eta, xi, y, and x. Each keys should have a list as the value with the name of variables.
params (dict (default: None)) –
A dict of LISREL representation non-zero parameters. Must contain the following keys: B, gamma, wedge_y, wedge_x, phi, theta_e, theta_del, and psi.

If None str_model must be specified.
fixed_params (dict (default: None)) –
A dict of fixed values for parameters. The shape of the parameters should be same as params.

If None all the parameters are learnable.

Returns:

pgmpy.models.SEM instance

Return type:

An instance of the object with initalized values.

Examples

>>> from pgmpy.models import SEMAlg
# TODO: Finish this example

class pgmpy.models.SEM.SEMAlg(eta=None, B=None, zeta=None, wedge_y=None, fixed_values=None)[source]¶

Base class for algebraic representation of Structural Equation Models(SEMs). The model is represented using the Reticular Action Model (RAM).

generate_samples(n_samples=100)[source]¶

Generates random samples from the model.

Parameters:: n_samples (int) – The number of samples to generate.
Returns:: pd.DataFrame
Return type:: The generated samples.

set_params(B, zeta)[source]¶

Sets the fixed parameters of the model.

Parameters:

B (2D array) – The B matrix.
zeta (2D array) – The covariance matrix.

to_SEMGraph()[source]¶

Creates a graph structure from the LISREL representation.

Returns:: pgmpy.models.SEMGraph instance
Return type:: A path model of the model.

Examples

>>> from pgmpy.models import SEMAlg
>>> model = SEMAlg()
# TODO: Finish this example

class pgmpy.models.SEM.SEMGraph(ebunch=[], latents=[], err_corr=[], err_var={})[source]¶

Base class for graphical representation of Structural Equation Models(SEMs).

All variables are by default assumed to have an associated error latent variable, therefore doesn’t need to be specified.

Parameters:

ebunch (list/array-like) –
List of edges in form of tuples. Each tuple can be of two possible shape:
1. (u, v): This would add an edge from u to v without setting any parameter
  for the edge.
2. (u, v, parameter): This would add an edge from u to v and set the edge’s
  parameter to parameter.
latents (list/array-like) – List of nodes which are latent. All other variables are considered observed.
err_corr (list/array-like) –
List of tuples representing edges between error terms. It can be of the following forms:
1. (u, v): Add correlation between error terms of u and v. Doesn’t set any variance or
  covariance values.
2. (u, v, covar): Adds correlation between the error terms of u and v and sets the
  parameter to covar.
err_var (dict (variable: variance)) – Sets variance for the error terms in the model.

Examples

Defining a model (Union sentiment model[1]) without setting any paramaters:

>>> from pgmpy.models import SEMGraph
>>> sem = SEMGraph(
...     ebunch=[
...         ("deferenc", "unionsen"),
...         ("laboract", "unionsen"),
...         ("yrsmill", "unionsen"),
...         ("age", "deferenc"),
...         ("age", "laboract"),
...         ("deferenc", "laboract"),
...     ],
...     latents=[],
...     err_corr=[("yrsmill", "age")],
...     err_var={},
... )

Defining a model (Education [2]) with all the parameters set. For not setting any parameter np.nan can be explicitly passed.

>>> sem_edu = SEMGraph(
...     ebunch=[
...         ("intelligence", "academic", 0.8),
...         ("intelligence", "scale_1", 0.7),
...         ("intelligence", "scale_2", 0.64),
...         ("intelligence", "scale_3", 0.73),
...         ("intelligence", "scale_4", 0.82),
...         ("academic", "SAT_score", 0.98),
...         ("academic", "High_school_gpa", 0.75),
...         ("academic", "ACT_score", 0.87),
...     ],
...     latents=["intelligence", "academic"],
...     err_corr=[],
...     err_var={"intelligence": 1},
... )

References

[1] McDonald, A, J., & Clelland, D. A. (1984). Textile Workers and Union Sentiment.: Social Forces, 63(2), 502–521
[2] https://en.wikipedia.org/wiki/Structural_equation_modeling#/: media/File:Example_Structural_equation_model.svg

latents¶

List of all the latent variables in the model except the error terms.

Type:: list

observed¶

List of all the observed variables in the model.

Type:: list

graph¶

The graphical structure of the latent and observed variables except the error terms. The parameters are stored in the weight attribute of each edge.

Type:: nx.DirectedGraph

err_graph¶

An undirected graph representing the relations between the error terms of the model. The node of the graph has the same name as the variable but represents the error terms. The variance is stored in the weight attribute of the node and the covariance are stored in the weight attribute of the edge.

Type:: nx.Graph

full_graph_struct¶

Represents the full graph structure. The names of error terms start with . and new nodes are added for each correlation which starts with ...

Type:: nx.DiGraph

active_trail_nodes(variables, observed=[], avoid_nodes=[], struct='full')[source]¶

Finds all the observed variables which are d-connected to variables in the graph_struct when observed variables are observed.

Parameters:

variables (str or array like) – Observed variables whose d-connected variables are to be found.
observed (list/array-like) – If given the active trails would be computed assuming these nodes to be observed.
avoid_nodes (list/array-like) – If specificed, the algorithm doesn’t account for paths that have influence flowing through the avoid node.
struct (str or nx.DiGraph instance) – If “full”, considers correlation between error terms for computing d-connection. If “non_error”, doesn’t condised error correlations for computing d-connection. If instance of nx.DiGraph, finds d-connected variables on the given graph.

Examples

>>> from pgmpy.models import SEM
>>> model = SEMGraph(
...     ebunch=[
...         ("yrsmill", "unionsen"),
...         ("age", "laboract"),
...         ("age", "deferenc"),
...         ("deferenc", "laboract"),
...         ("deferenc", "unionsen"),
...         ("laboract", "unionsen"),
...     ],
...     latents=[],
...     err_corr=[("yrsmill", "age")],
... )
>>> model.active_trail_nodes("age")

Returns:: dict – Returns a dict with variables as the key and a list of d-connected variables as the value.
Return type:: {str: list}

References

Details of the algorithm can be found in ‘Probabilistic Graphical Model Principles and Techniques’ - Koller and Friedman Page 75 Algorithm 3.1

get_scaling_indicators()[source]¶

Returns a scaling indicator for each of the latent variables in the model. The scaling indicator is chosen randomly among the observed measurement variables of the latent variable.

Examples

>>> from pgmpy.models import SEMGraph
>>> model = SEMGraph(
...     ebunch=[
...         ("xi1", "eta1"),
...         ("xi1", "x1"),
...         ("xi1", "x2"),
...         ("eta1", "y1"),
...         ("eta1", "y2"),
...     ],
...     latents=["xi1", "eta1"],
... )
>>> model.get_scaling_indicators()
{'xi1': 'x1', 'eta1': 'y1'}

Returns:: dict – scaling indicator.
Return type:: Returns a dict with latent variables as the key and their value being the

moralize(graph='full')[source]¶

TODO: This needs to go to a parent class. Removes all the immoralities in the DirectedGraph and creates a moral graph (UndirectedGraph).

A v-structure X->Z<-Y is an immorality if there is no directed edge between X and Y.

Parameters:: graph

Examples

to_lisrel()[source]¶

Converts the model from a graphical representation to an equivalent algebraic representation. This converts the model into a Reticular Action Model (RAM) model representation which is implemented by pgmpy.models.SEMAlg class.

Returns:: SEMAlg instance
Return type:: Instance of SEMAlg representing the model.

Examples

>>> from pgmpy.models import SEM
>>> sem = SEM.from_graph(
...     ebunch=[
...         ("deferenc", "unionsen"),
...         ("laboract", "unionsen"),
...         ("yrsmill", "unionsen"),
...         ("age", "deferenc"),
...         ("age", "laboract"),
...         ("deferenc", "laboract"),
...     ],
...     latents=[],
...     err_corr=[("yrsmill", "age")],
...     err_var={},
... )
>>> sem.to_lisrel()
# TODO: Complete this.

Structural Equation Models (SEM)¶

Navigation

Related Topics