Document-Term Matrix with Quanteda

Question

I have a dataframe df with this structure :

Rank Review
5    good film
8    very goood film
..

Then I tried to create a DocumentTermMatris using quanteda package :

temp.tf <- df$Review %>% tokens(ngrams = 1:1) %>% # generate tokens
+   dfm %>% # generate dfm
+   convert(to = "tm")

I get this matrix :

> inspect(temp.tf)
<<DocumentTermMatrix (documents: 63023, terms: 23892)>>
Non-/sparse entries: 520634/1505224882
Sparsity           : 100%
Maximal term length: 77
Weighting          : term frequency (tf)
Sample             :

Whith this structure :

           Terms
Docs        good very film my excellent heart David plus always so
  text14670 1       0      0      0   1          0      0    0        2    0
  text19951 3       0      0      0   0          0      0    1        1    1
  text24305 7       0      2      1   0          0      0    2        0    0
  text26985 6       0      0      0   0          0      0    4        0    1
  text29518 4       0      1      0   1          0      0    3        0    1
  text34547 5       2      0      0   0          0      2    3        1    3
  text3781  3       0      1      4   0          0      0    3        0    0
  text5272  4       0      0      4   0          5      0    3        1    2
  text5367  3       0      1      3   0          0      1    4        0    1
  text6001  3       0      9      1   0          6      0    1        0    1

So I think It is good , but I think that : text6001 , text5367, text5272 ... refer to document's name... My question is that rows in this matrix are ordered? or randoms putted in the matrix?

Thank you

EDIT :

I created a document term frequency :

mydfm <- dfm(df$Review, remove = stopwords("french"), stem = TRUE)

Then, I created a tf-idf matrix :

tfidf <- tfidf(mydfm)[, 5:10]

Then I would like to merge the tfidf matrix to the Rank column to have something like this

         features
Docs        good   very   film   my excellent heart    David plus  always so Rank
  text14670 1       0      0      0   1          0      0    0        2    0 3
  text19951 3       0      0      0   0          0      0    1        1    1 2
  text24305 7       0      2      1   0          0      0    2        0    0 4
  text26985 6       0      0      0   0          0      0    4        0    1 5

Can you help to make this merge?

Thank you

score 1 · Accepted Answer · answered Jun 01 '17 at 09:55

1

The rows (documents) are alphabetically ordered, which is why text14670 comes before text19951. It is possible that the conversion has reordered the documents, but you can test this using

sum(rownames(temp.tf) == sort(rownames(temp.tf))

If that is not 0, then they are not alphabetically ordered.

The feature ordering, at least in the quanteda dfm, come from the order in which they are found in the texts. You can resort both using dfm_sort().

In your code, the tokens(ngrams = 1:1) is unnecessary since dfm() does that and ngrams = 1 is the default.

Also, do you need to convert this to a tm object? Probably most of what you need can be done in quanteda.

answered Jun 01 '17 at 09:55

Ken Benoit

14,454
27
50

Thank you, last question .. is it possible with Quanteda to create tf-idf matrix? Bests – dr.nasri84 Jun 01 '17 at 10:10
I just edit my post, can you help to resolve my problem please? thanks – dr.nasri84 Jun 01 '17 at 12:42
It's not clear what you are asking - what is `Rank`? _tf-idf_ is cell-specific, so you cannot covert that into a document-level feature (with one value per document). I suggest starting a new SO question and making it more clear what output you expect, with a reproducible example. – Ken Benoit Jun 01 '17 at 16:02

Document-Term Matrix with Quanteda

1 Answers1