PillaiTrace#
- class pgmpy.ci_tests.PillaiTrace(data: DataFrame, seed=None)[source]#
Bases:
_BaseCITestPillai’s trace test for conditional independence with mixed data [1].
This test first residualizes \(X\) and \(Y\) with respect to \([1, Z]\) using an estimator (XGBoost by default). For a continuous target \(T\), the residual is
\[r_T = T - \hat{T}(Z).\]For a categorical target \(T\) with \(K\) categories, let \(D_T \in \{0, 1\}^{n \times K}\) be the dummy-encoded matrix of \(T\), and let \(\hat{D}_T(Z)\) denote the predicted class probabilities from the classifier. The residual matrix [2] is defined as (last column dropped to avoid colinearity):
\[R_T = \operatorname{drop\_last}\left(D_T - \hat{D}_T(Z)\right),\]Let \(R_X \in \mathbb{R}^{n \times p}\) and \(R_Y \in \mathbb{R}^{n \times q}\) be the residuals of \(X\) and \(Y\), and \(\rho = \rho_1, \ldots, \rho_s\) be the canonical correlations between them. The Pillai’s trace statistic is:
\[V = \sum_{i=1}^{s} \rho_i^2.\]The p-value is computed using \(F\)-approximation as (\(p\) and \(q\) are the number of columns in residual matrices):
\[F = \frac{V / (pq)}{(s - V) / \left[s (n - 1 + s - p - q)\right]} = \frac{V}{pq} \cdot \frac{s (n - 1 + s - p - q)}{s - V},\]with numerator degrees of freedom \(df_1 = pq\) and denominator degrees of freedom \(df_2 = s (n - 1 + s - p - q)\), where \(n\) is the sample size.
- Parameters:
- datapandas.DataFrame
The dataset in which to test the independence condition.
- seedint, optional
Random seed used for the underlying XGBoost models.
- Attributes:
- statistic_float
Pillai’s trace statistic \(V\). Set after calling the test.
- p_value_float
The p-value for the test, computed via F-approximation. Set after calling the test.
References
[1]Ankan, Ankur, and Johannes Textor. “A simple unified approach to testing high-dimensional conditional independences for categorical and ordinal data.” Proceedings of the AAAI Conference on Artificial Intelligence.
[2]Li, C.; and Shepherd, B. E. 2010. Test of Association Between Two Ordinal Variables While Adjusting for Covariates. Journal of the American Statistical Association.
[3]Muller, K. E. and Peterson B. L. (1984) Practical Methods for computing power in testing the multivariate general linear hypothesis. Computational Statistics & Data Analysis.