PillaiTrace#

class pgmpy.ci_tests.PillaiTrace(data: DataFrame, seed=None)[source]#

Bases: _BaseCITest

Pillai’s trace test for conditional independence with mixed data [1].

This test first residualizes \(X\) and \(Y\) with respect to \([1, Z]\) using an estimator (XGBoost by default). For a continuous target \(T\), the residual is

\[r_T = T - \hat{T}(Z).\]

For a categorical target \(T\) with \(K\) categories, let \(D_T \in \{0, 1\}^{n \times K}\) be the dummy-encoded matrix of \(T\), and let \(\hat{D}_T(Z)\) denote the predicted class probabilities from the classifier. The residual matrix [2] is defined as (last column dropped to avoid colinearity):

\[R_T = \operatorname{drop\_last}\left(D_T - \hat{D}_T(Z)\right),\]

Let \(R_X \in \mathbb{R}^{n \times p}\) and \(R_Y \in \mathbb{R}^{n \times q}\) be the residuals of \(X\) and \(Y\), and \(\rho = \rho_1, \ldots, \rho_s\) be the canonical correlations between them. The Pillai’s trace statistic is:

\[V = \sum_{i=1}^{s} \rho_i^2.\]

The p-value is computed using \(F\)-approximation as (\(p\) and \(q\) are the number of columns in residual matrices):

\[F = \frac{V / (pq)}{(s - V) / \left[s (n - 1 + s - p - q)\right]} = \frac{V}{pq} \cdot \frac{s (n - 1 + s - p - q)}{s - V},\]

with numerator degrees of freedom \(df_1 = pq\) and denominator degrees of freedom \(df_2 = s (n - 1 + s - p - q)\), where \(n\) is the sample size.

Parameters:
datapandas.DataFrame

The dataset in which to test the independence condition.

seedint, optional

Random seed used for the underlying XGBoost models.

Attributes:
statistic_float

Pillai’s trace statistic \(V\). Set after calling the test.

p_value_float

The p-value for the test, computed via F-approximation. Set after calling the test.

References

[1]

Ankan, Ankur, and Johannes Textor. “A simple unified approach to testing high-dimensional conditional independences for categorical and ordinal data.” Proceedings of the AAAI Conference on Artificial Intelligence.

[2]

Li, C.; and Shepherd, B. E. 2010. Test of Association Between Two Ordinal Variables While Adjusting for Covariates. Journal of the American Statistical Association.

[3]

Muller, K. E. and Peterson B. L. (1984) Practical Methods for computing power in testing the multivariate general linear hypothesis. Computational Statistics & Data Analysis.

run_test(X: str, Y: str, Z: list)[source]#

Compute Pillai’s trace statistic and p-value.

Sets self.statistic_ (Pillai’s trace) and self.p_value_.