Feature extraction using Chi2 with Quanteda

Question

I have a dataframe df with this structure :

Rank Review
5    good film
8    very good film
..

Then I tried to create a DocumentTermMatris using quanteda package :

mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)

I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value

Can you help me to resolve this problem please?

EDIT :

head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)


> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
       features
docs    bon accueil conseillèr efficac écout répond
  text1   0       0          0       0     0      0
  text2   1       1          1       1     1      1
  text3   0       0          0       0     0      0
  text4   0       0          0       0     0      0
  text5   0       0          1       0     0      0
  text6   0       0          0       0     1      0
  ...
  text60300 0     0          1       1     1      1

Here I have my dfm matrix, then I create my tf-idf matrix :

tfidf <- tfidf(mydfm)[, 5:10]

I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :

textstat_keyness(mydfm, target = 2)

But, since I have 60300 target, I don't know how to do this automatically . I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(

EDIT 2 :

Rank Review 10 always good 1 nice film 3 fine as usual

Here I try to group document with dfm :

 mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)

But it fails to group documents

Can you help me please to resolve this problem

Thank you

score 1 · Accepted Answer · answered Jun 01 '17 at 15:45

1

See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.

textstat_keyness(mydfm, target = 1)

for the first document against the frequencies of all others, or

textstat_keyness(mydfm, target = 2)

for the second against all others, etc.

If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

answered Jun 01 '17 at 15:45

Ken Benoit

14,454
27
50

Ok Thank you, I understand it"s so clear, but in case I have 200000 document, how can Run It in one time? – dr.nasri84 Jun 01 '17 at 17:17
You probably want to group documents then. For instance if half are coded "positive" and half "negative`', you could compute keyness for the terms in one group versus the other. Otherwise you are right, computing this for 2e5 documents is not useful. – Ken Benoit Jun 01 '17 at 17:40
I just edit my post, can you help me please to resolve my question? thank you – dr.nasri84 Jun 02 '17 at 07:48
1

try `groups = df$Rank` in the call to `dfm()` – Ken Benoit Jun 02 '17 at 21:54

Feature extraction using Chi2 with Quanteda

1 Answers1