14

I'm experimenting with Chi-2 feature selection for some text classification tasks. I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom.

Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,

This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.

It seems to me that we we can also perform Chi-2 feature selection on DF (word counts) vector presentation. My 1st question is: how does sklearn discretize the integer-valued feature into categorical?

My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html

It seems to me that we can also perform Chi-2 feature selection on a TF*IDF vector representation. How sklearn perform Chi-2 feature selection on real-valued features?

Thank you in advance for your kind advise!

Moses Xu
  • 2,140
  • 4
  • 24
  • 35

1 Answers1

16

The χ² features selection code builds a contingency table from its inputs X (feature values) and y (class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y) and a uniform distribution over feature values.

This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.

It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thank you very much for your kind answer @larsmans. I understand the **values** in a contingency table -- the cells can take any non-negative real values. What I'm confused about is the **column names** for the contingency table. For example, if the contingency table for feature "X" is based on binary BOW feature vectors, the column names would be "X presents in document" and "X is absent from document". What would they be if the underlying feature vectors are integer or real valued? Something like "X appears in document 0-5 times", "X appears in document 6-10 times", etc. ?? – Moses Xu Jan 31 '13 at 05:41
  • The columns correspond to the terms directly. As I tried to explain, cell i,j contains the frequency of feature i in class j, and no discretization is performed. – Fred Foo Jan 31 '13 at 08:41
  • 1
    Thank you @larsmans for your patient explanation -- I now get how it's calculated. I was thinking about it the wrong way before. The NULL hypothesis really is "document class has no influence over feature frequency". – Moses Xu Jan 31 '13 at 13:09