Conditional Independence Tests for PC algorithm¶

pgmpy.estimators.CITests.
chi_square
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Chisquare conditional independence test. Tests the null hypothesis that X is independent from Y given Zs.
This is done by comparing the observed frequencies with the expected frequencies if X,Y were conditionally independent, using a chisquare deviance statistic. The expected frequencies given independence are . The latter term can be computed as :math:`P(X,Zs)*P(Y,Zs)/P(Zs).
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list, arraylike) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X u27C2 Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] https://en.wikipedia.org/wiki/Chisquared_test
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> chi_square(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
cressie_read
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Cressie Read statistic for conditional independence[1]. Tests the null hypothesis that X is independent of Y given Zs.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (arraylike)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] Cressie, Noel, and Timothy RC Read. “Multinomial goodness‐of‐fit tests.” Journal of the Royal Statistical Society: Series B (Methodological) 46.3 (1984): 440464.
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> cressie_read(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> cressie_read(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> cressie_read(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
freeman_tuckey
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Freeman Tuckey test for conditional independence [1]. Tests the null hypothesis that X is independent of Y given Zs.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (arraylike)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] Read, Campbell B. “Freeman—Tukey chisquared goodnessoffit statistics.” Statistics & probability letters 18.4 (1993): 271278.
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> freeman_tuckey(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> freeman_tuckey(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> freeman_tuckey(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
g_sq
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ G squared test for conditional independence. Also commonly known as Gtest, likelihoodratio or maximum likelihood statistical significance test. Tests the null hypothesis that X is independent of Y given Zs.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (arraylike)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] https://en.wikipedia.org/wiki/Gtest
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> g_sq(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> g_sq(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> g_sq(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
independence_match
(X, Y, Z, independencies, **kwargs)[source]¶ Checks if X ⟂ Y  Z is in independencies. This method is implemneted to have an uniform API when the independencies are provided instead of data.
 Parameters
X (str) – The first variable for testing the independence condition X ⟂ Y  Z
Y (str) – The second variable for testing the independence condition X ⟂ Y  Z
Z (list/arraylike) – A list of conditional variable for testing the condition X ⟂ Y  Z
data (pandas.DataFrame The dataset in which to test the indepenedence condition.) –
 Returns
pvalue
 Return type
float (Fixed to 0 since it is always confident)

pgmpy.estimators.CITests.
log_likelihood
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Log likelihood ratio test for conditional independence. Also commonly known as Gtest, Gsquared test or maximum likelihood statistical significance test. Tests the null hypothesis that X is independent of Y given Zs.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (arraylike)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] https://en.wikipedia.org/wiki/Gtest
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> log_likelihood(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> log_likelihood(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> log_likelihood(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
modified_log_likelihood
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Modified log likelihood ratio test for conditional independence. Tests the null hypothesis that X is independent of Y given Zs.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (arraylike)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> modified_log_likelihood(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> modified_log_likelihood(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> modified_log_likelihood(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
neyman
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Neyman’s test for conditional independence[1]. Tests the null hypothesis that X is independent of Y given Zs.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list (arraylike)) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] https://en.wikipedia.org/wiki/Neyman%E2%80%93Pearson_lemma
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> neyman(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> neyman(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> neyman(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False

pgmpy.estimators.CITests.
pearsonr
(X, Y, Z, data, boolean=True, **kwargs)[source]¶ Computes Pearson correlation coefficient and pvalue for testing noncorrelation. Should be used only on continuous data. In case when uses linear regression and computes pearson coefficient on residuals.
 Parameters
X (str) – The first variable for testing the independence condition X u27C2 Y  Z
Y (str) – The second variable for testing the independence condition X u27C2 Y  Z
Z (list/arraylike) – A list of conditional variable for testing the condition X u27C2 Y  Z
data (pandas.DataFrame) – The dataset in which to test the indepenedence condition.
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
 If boolean=False, returns the pearson correlation coefficient and p_value
of the test.
 Returns
Pearson’s correlation coefficient (float)
pvalue (float)
References
[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficient [2] https://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression

pgmpy.estimators.CITests.
power_divergence
(X, Y, Z, data, boolean=True, lambda_='cressieread', **kwargs)[source]¶ Computes the CressieRead power divergence statistic [1]. The null hypothesis for the test is X is independent of Y given Z. A lot of the frequency comparision based statistics (eg. chisquare, Gtest etc) belong to power divergence family, and are special cases of this test.
 Parameters
X (int, string, hashable object) – A variable name contained in the data set
Y (int, string, hashable object) – A variable name contained in the data set, different from X
Z (list, arraylike) – A list of variable names contained in the data set, different from X and Y. This is the separating set that (potentially) makes X and Y independent. Default: []
data (pandas.DataFrame) – The dataset on which to test the independence condition.
lambda (float or string) –
The lambda parameter for the power_divergence statistic. Some values of lambda_ results in other well known tests:
”pearson” 1 “Chisquared test” “loglikelihood” 0 “Gtest or loglikelihood” “freemantuckey” 1/2 “FreemanTuckey Statistic” “modloglikelihood” 1 “Modified Loglikelihood” “neyman” 2 “Neyman’s statistic” “cressieread” 2/3 “The value recommended in the paper[1]”
boolean (bool) –
 If boolean=True, an additional argument significance_level must
be specified. If p_value of the test is greater than equal to significance_level, returns True. Otherwise returns False.
If boolean=False, returns the chi2 and p_value of the test.
 Returns
If boolean = False, Returns 3 values –
 chi: float
The chisqure test statistic.
 p_value: float
The p_value, i.e. the probability of observing the computed chisquare statistic (or an even higher value), given the null hypothesis that X ⟂ Y  Zs.
 dof: int
The degrees of freedom of the test.
If boolean = True, returns –
 independent: boolean
If the p_value of the test is greater than significance_level, returns True. Else returns False.
References
[1] Cressie, Noel, and Timothy RC Read. “Multinomial goodness‐of‐fit tests.” Journal of the Royal Statistical Society: Series B (Methodological) 46.3 (1984): 440464.
Examples
>>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame(np.random.randint(0, 2, size=(50000, 4)), columns=list('ABCD')) >>> data['E'] = data['A'] + data['B'] + data['C'] >>> chi_square(X='A', Y='C', Z=[], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D'], data=data, boolean=True, significance_level=0.05) True >>> chi_square(X='A', Y='B', Z=['D', 'E'], data=data, boolean=True, significance_level=0.05) False