Class for constraint-based estimation of DAGs using the PC algorithm
from a given data set. Identifies (conditional) dependencies in data
set using statistical independence tests and estimates a DAG pattern
that satisfies the identified dependencies. The DAG pattern can then be
completed to a faithful DAG, if possible.
When used with expert knowledge, the following flowchart can help you figure
out the expected results based on different choices of parameters and the
structure learned from the data.
┌──────────────────┐ No ┌─────────────┐
│ Expert Knowledge ├──────────► │ Normal PC │
│ specified? │ │ run │
└────────┬─────────┘ └─────────────┘
┌──────────────────────────────┐ ┌─────────────────────────┐
│ │ │ │
│ 1) Forbidden edges are │ │ Conflicts with learned │
│ removed from the skeleton │ │ structure (opposite │
│ │ │ edge orientations)? │
│ 2) Required edges will be │ │ │
│ present in the final │ └───────────┬─────────────┘
│ model (but direction is │ │
│ not guaranteed) │ ┌────────────────┴──────────────────┐
│ │ Yes │ │ No
└──────────────────────────────┘ │ │
data (pandas DataFrame object) – dataframe object where each column represents one variable. (If some
values in the data are missing the data cells should be set to
numpy.nan. Note that pandas converts each column containing
numpy.nan`s to dtype `float.)
Estimates a DAG/PDAG from the given dataset using the PC algorithm which
is a constraint-based structure learning algorithm[1]. The independencies
in the dataset are identified by doing statistical independence test. This
method returns a DAG/PDAG structure which is faithful to the independencies
implied by the dataset.
Parameters:
variant (str (one of "orig", "stable", "parallel")) –
The variant of PC algorithm to run.
”orig”: The original PC algorithm. Might not give the same
results in different runs but does less independence
tests compared to stable.
”stable”: Gives the same result in every run but does needs to
do more statistical independence tests.
”parallel”: Parallel version of PC Stable. Can run on multiple
cores with the same result on each run.
ci_test (str or fun) –
The statistical test to use for testing conditional independence in
the dataset. If str values should be one of:
”independence_match”: If using this option, an additional parameter
independencies must be specified.
”chi_square”: Uses the Chi-Square independence test. This works
only for discrete datasets.
”pearsonr”: Uses the partial correlation based on pearson
correlation coefficient to test independence. This works
only for continuous datasets.
”g_sq”: G-test. Works only for discrete datasets.
“log_likelihood”: Log-likelihood test. Works only for discrete dataset.
“freeman_tuckey”: Freeman Tuckey test. Works only for discrete dataset.
“modified_log_likelihood”: Modified Log Likelihood test. Works only for discrete variables.
“neyman”: Neyman test. Works only for discrete variables.
“cressie_read”: Cressie Read test. Works only for discrete variables.
return_type (str (one of "dag", "cpdag", "pdag", "skeleton")) –
The type of structure to return.
If return_type=pdag or return_type=cpdag: a partially directed structure
is returned.
If return_type=dag, a fully directed structure is returned if it
is possible to orient all the edges.
If `return_type=”skeleton”, returns an undirected graph along
with the separating sets.
significance_level (float (default: 0.01)) –
The statistical tests use this value to compare with the p-value of
the test to decide whether the tested variables are independent or
not. Different tests can treat this parameter differently:
Chi-Square: If p-value > significance_level, it assumes that the
independence condition satisfied in the data.
pearsonr: If p-value > significance_level, it assumes that the
independence condition satisfied in the data.
max_cond_vars (int (default: 5)) – The maximum number of variables to condition on while testing
independence.
expert_knowledge (pgmpy.estimators.ExpertKnowledge instance) – Expert knowledge to be used with the algorithm. Expert knowledge
includes required/forbidden edges in the final graph, temporal
information about the variables etc. Please refer
pgmpy.estimators.ExpertKnowledge class for more details.
If True, the algorithm modifies the search space according to the
edges specified in expert knowledge object. This implies the following:
For every edge (u, v) specified in forbidden_edges, there will
be no edge between u and v.
For every edge (u, v) specified in required_edges, one of the
following would be present in the final model: u -> v, u <-
v, or u - v (if CPDAG is returned).
If False, the algorithm attempts to make the edge orientations as
specified by expert knowledge after learning the skeleton. This
implies the following:
For every edge (u, v) specified in forbidden_edges, the final
graph would have either v <- u or no edge except if u -> v is part
of a collider structure in the learned skeleton.
For every edge (u, v) specified in required_edges, the final graph
would either have u -> v or no edge except if v <- u is part of a
collider structure in the learned skeleton.
n_jobs (int (default: -1)) – The number of jobs to run in parallel.
show_progress (bool (default: True)) – If True, shows a progress bar while running the algorithm.
Returns:
Estimated model –
The estimated model structure:
Partially Directed Graph (PDAG) if return_type=’pdag’ or return_type=’cpdag’.
Directed Acyclic Graph (DAG) if return_type=’dag’.
(nx.Graph, separating sets) if return_type=’skeleton’.
Orients the edges that form v-structures in a graph skeleton
based on information from separating_sets to form a DAG pattern (PDAG).
Parameters:
skeleton (nx.Graph) – An undirected graph skeleton as e.g. produced by the
estimate_skeleton method.
separating_sets (dict) – A dict containing for each pair of not directly connected nodes a
separating set (“witnessing set”) of variables that makes them
conditionally independent.
Returns:
Model after edge orientation – An estimate for the DAG pattern of the BN underlying the data. The
graph might contain some nodes with both-way edges (X->Y and Y->X).
Any completion by (removing one of the both-way edges for each such
pair) results in a I-equivalent Bayesian network DAG.
name (str) – The name of the test (case-insensitive).
data_types (list of str) – List of data types this test supports (e.g., [‘continuous’, ‘discrete’]).
is_default (bool, default=False) – If True, sets this test as the default for the provided data types.
pgmpy.estimators.CITests.chi_square(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Perform Chi-square conditional independence test.
Tests the null hypothesis that X is independent from Y given Zs.
Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list, array-like) – A list of variable names contained in the data set, different from X and Y.
This is the separating set that (potentially) makes X and Y independent.
Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
Returns:
result – If boolean=False, returns (chi, p_value, dof).
If boolean=True, returns True if p_value > significance_level.
pgmpy.estimators.CITests.g_sq(X, Y, Z, data, boolean=True, **kwargs)[source]¶
G squared test for conditional independence. Also commonly known as G-test,
likelihood-ratio or maximum likelihood statistical significance test.
Tests the null hypothesis that X is independent of Y given Zs.
Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (array-like)) – A list of variable names contained in the data set, different from X and Y.
This is the separating set that (potentially) makes X and Y independent.
Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be
specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False. If
boolean=False, returns the chi2 and p_value of the test.
Returns:
result – If boolean=False, returns (chi, p_value, dof).
If boolean=True, returns True if p_value > significance_level.
pgmpy.estimators.CITests.gcm(X, Y, Z, data, boolean=True, **kwargs)[source]¶
The Generalized Covariance Measure(GCM) test for CI.
It performs linear regressions on the conditioning variable and then tests
for a vanishing covariance between the resulting residuals. Details of the
method can be found in [1].
Parameters:
X (str) – The first variable for testing the independence condition X ⟂ Y | Z
Y (str) – The second variable for testing the independence condition X ⟂ Y | Z
Z (list/array-like) – A list of conditional variable for testing the condition X ⟂ Y | Z
data (pandas.DataFrame) – The dataset in which to test the independence condition.
boolean (bool) –
If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False.
If boolean=False, returns the pearson correlation coefficient and p_value
of the test.
Returns:
CI Test results – If boolean=True, returns True if p-value >= significance_level, else False. If
boolean=False, returns a tuple of (Pearson’s correlation Coefficient, p-value)
Return type:
tuple or bool
References
pgmpy.estimators.CITests.independence_match(X, Y, Z, independencies, **kwargs)[source]¶
Check if X ⟂ Y | Z is in independences.
This method is implemented to have a uniform API when the independences
are provided explicitly instead of being inferred from data.
Parameters:
X (str) – The first variable for testing the independence condition X ⟂ Y | Z.
Y (str) – The second variable for testing the independence condition X ⟂ Y | Z.
Z (list or array-like) – A list of conditional variables for testing the condition X ⟂ Y | Z.
independencies (pgmpy.independencies.Independencies) – The object containing the known independences.
Returns:
True if the independence assertion is present in independences, else False.
Return type:
bool
pgmpy.estimators.CITests.log_likelihood(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Log likelihood ratio test for conditional independence. Also commonly known
as G-test, G-squared test or maximum likelihood statistical significance
test. Tests the null hypothesis that X is independent of Y given Zs.
Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (array-like)) – A list of variable names contained in the data set, different from X and Y.
This is the separating set that (potentially) makes X and Y independent.
Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be
specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False. If
boolean=False, returns the chi2 and p_value of the test.
Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the
chi-squared test statistic. The p_value for the test, i.e. the
probability of observing the computed chi-square statistic (or an even
higher value), given the null hypothesis that X ⟂ Y | Zs is True.
If boolean = True, returns True if the p_value of the test is greater
than significance_level else returns False.
pgmpy.estimators.CITests.modified_log_likelihood(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Modified log likelihood ratio test for conditional independence.
Tests the null hypothesis that X is independent of Y given Zs.
Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (array-like)) – A list of variable names contained in the data set, different from X and Y.
This is the separating set that (potentially) makes X and Y independent.
Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) – If boolean=True, an additional argument significance_level must be
specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
Returns:
CI Test Results – If boolean = False, Returns a tuple (chi, p_value, dof). chi is the
chi-squared test statistic. The p_value for the test, i.e. the
probability of observing the computed chi-square statistic (or an even
higher value), given the null hypothesis that X ⟂ Y | Zs is True.
If boolean = True, returns True if the p_value of the test is greater
than significance_level else returns False.
pgmpy.estimators.CITests.pearsonr(X, Y, Z, data, boolean=True, **kwargs)[source]¶
Compute Pearson correlation coefficient and p-value for testing non-correlation.
Should be used only on continuous data. In case when Z \neq \emptyset uses
linear regression and computes pearson coefficient on residuals.
Parameters:
X (str) – The first variable for testing the independence condition X ⟂ Y | Z.
Y (str) – The second variable for testing the independence condition X ⟂ Y | Z.
Z (list or array-like) – A list of conditional variables for testing the condition X ⟂ Y | Z.
data (pandas.DataFrame) – The dataset in which to test the independence condition.
boolean (bool, default=True) – If True, returns a boolean indicating independence (based on significance_level).
If False, returns the test statistic and p-value.
**kwargs – Additional arguments. Must contain significance_level if boolean=True.
Returns:
result – If boolean=True, returns True if p-value >= significance_level, else False.
If boolean=False, returns a tuple of (Pearson’s correlation Coefficient, p-value).
Return type:
bool or tuple
References
pgmpy.estimators.CITests.pillai_trace(X, Y, Z, data, boolean=True, **kwargs)[source]¶
A mixed-data residualization based conditional independence test[1].
Uses XGBoost estimator to compute LS residuals[2], and then does an
association test (Pillai’s Trace) on the residuals.
Parameters:
X (str) – The first variable for testing the independence condition X ⟂ Y | Z
Y (str) – The second variable for testing the independence condition X ⟂ Y | Z
Z (list/array-like) – A list of conditional variable for testing the condition X ⟂ Y | Z
data (pandas.DataFrame) – The dataset in which to test the independence condition.
boolean (bool) –
If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False.
If boolean=False, returns the pearson correlation coefficient and p_value
of the test.
Returns:
CI Test results – If boolean=True, returns True if p-value >= significance_level, else False. If
boolean=False, returns a tuple of (Pearson’s correlation Coefficient, p-value)
Return type:
tuple or bool
References
pgmpy.estimators.CITests.power_divergence(X, Y, Z, data, boolean=True, lambda_='cressie-read', **kwargs)[source]¶
Computes the Cressie-Read power divergence statistic [1]. The null hypothesis
for the test is X is independent of Y given Z. A lot of the frequency comparision
based statistics (eg. chi-square, G-test etc) belong to power divergence family,
and are special cases of this test.
Parameters:
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list, array-like) – A list of variable names contained in the data set, different from X and Y.
This is the separating set that (potentially) makes X and Y independent.
Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
lambda (float or string) –
The lambda parameter for the power_divergence statistic. Some values of
lambda_ results in other well known tests:
”pearson” 1 “Chi-squared test”
”log-likelihood” 0 “G-test or log-likelihood”
”freeman-tuckey” -1/2 “Freeman-Tuckey Statistic”
”mod-log-likelihood” -1 “Modified Log-likelihood”
”neyman” -2 “Neyman’s statistic”
”cressie-read” 2/3 “The value recommended in the paper[1]”
boolean (bool) –
If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to
significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
**kwargs – Must contain significance_level if boolean=True.
Returns:
result – If boolean=False, returns (chi, p_value, dof).
If boolean=True, returns True if p_value > significance_level.