5

I am trying to predict a binary (categorical) target from many continuous features, and would like to narrow your feature space before heading into model fitting. I noticed that the SelectKBest class from SKLearn's Feature Selection package has the following example on the Iris dataset (which is also predicting a binary target from continuous features):

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150,2)

The example uses the chi2 test to determine which features should be used in the model. However it is my understanding that the chi2 test is strictly meant to be used in situations where we have categorical features predicting categorical performance. I did not think the chi2 test could be used for scenarios like this. Is my understanding wrong? Can the chi2 test be used to test whether a categorical variable is dependent on a continuous variable?

vanchman
  • 51
  • 1
  • 2

4 Answers4

5

The SelectKBest function with the chi2 test only works with categorical data. In fact, the result from the test only will have real meaning if the feature only has 1's and 0's.

If you inspect a little bit the implementation of chi2 you going to see that the code only apply a sum across each feature, which means that the function expects just binary values. Also, the parameters that receive the chi2 function indicate the following:

def chi2(X, y):
...

X : {array-like, sparse matrix}, shape = (n_samples, n_features_in)
    Sample vectors.
y : array-like, shape = (n_samples,)
    Target vector (class labels).

Which means that the function expects to receive the feature vector with all their samples. But later when the expected values are calculated, you will see:

feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)

And these lines of code only make sense if the X and Y vector only has 1's and 0's.

Dijkgraaf
  • 11,049
  • 17
  • 42
  • 54
lalfab
  • 351
  • 3
  • 7
  • But this is indeed confusing from the official doc - https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection – LYu Feb 05 '21 at 05:09
3

I agree with @lalfab however, it's not clear to me why sklearn provides an example of using chi2 on the iris dataset which has all continuous variables. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)
xyzzy
  • 31
  • 1
0

My understand of this is that when using Chi2 for feature selection, the dependent variable has to be categorical type, but the independent variables can be either categorical or continuous variables, as long as it's non-negative. What the algorithm trying to do is firstly build a contingency table in a matrix format that reveals the multivariate frequency distribution of the variables. Then try to find the dependence structure underlying the variables using this contingency table. The Chi2 is one way to measure the dependency.

From the Wikipedia on contingency table (https://en.wikipedia.org/wiki/Contingency_table, 2020-07-04):

Standard contents of a contingency table

  • Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to as banner points or cuts (and the rows are sometimes referred to as stubs).
  • Significance tests. Typically, either column comparisons, which test for differences between columns and display these results using letters, or, cell comparisons, which use color or arrows to identify a cell in a table that stands out in some way.
  • Nets or netts which are sub-totals.
  • One or more of: percentages, row percentages, column percentages, indexes or averages.
  • Unweighted sample sizes (counts).

Based on this, pure binary features can be easily summed up as counts, which is how people conduct the Chi2 test usually. But as long as the features are non-negative, one can always accumulated it in the contingency table in a "meaningful" way. In the sklearn implementation, it sums up as feature_count = X.sum(axis=0), then later averaged on class_prob.

EricX
  • 425
  • 5
  • 6
  • It sounds like you're saying the continuous variable is binned to create a category which then has a frequency count like you'd calculate for a histogram. If this is the case, couldn't one calculate a Chi^2 test statistic and p-value by binning two continuous variables? Of course you'd be losing information by binning and would be better off using Pearson's correlations, but is there any way to quantify what's being lost by the binning process? – DrRaspberry Mar 23 '21 at 02:53
0

In my understanding, you cannot use chi-square (chi2) for continuous variables.The chi2 calculation requires to build the contingency table, where you count occurrences of each category of the variables of interest. As the cells in that RC table correspond to particular categories, I cannot see how such table could be built from continuous variables without significant preprocessing.

So, the iris example which you quote, in my view, is an example of incorrect usage.

But there are more problems with the existing implementation of the chi2 feature reduction in Scikit-learn. First, as @lalfab wrote, the implementation requires binary feature, but the documentation is not clear about this. This led to common perception in the community that SelectKBest could be used for categorical features, while in fact it cannot. Second, the Scikit-learn implementation fails to implement the chi2 condition (80% cells of RC table need to have expected count >=5) which leads to incorrect results if some categorical features have many possible values. All in all, in my view this method should not be used neither for continuous, nor for categorical features (except binary). I wrote more about this below:

Here is the Scikit-learn bug request #21455: and here the article and the alternative implementation:

Data Man
  • 51
  • 5
  • Good catch @Data Man Another problem of sklearn's chi2 implementation is that it could output sparse data as pointed [here](https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_selection/_univariate_selection.py#L228). I'm gonna check your references. – Maurício Collaça Sep 15 '22 at 23:55