My question is simple, the Quanteda package in R has a function for calculating the Term Frequency (tf) of a Document Frequency Matrix (dfm). When you look at the description of tf function with ?tf, it says it has four arguments. My question is regarding the 'scheme' argument. I don´t understant how to use the maxCount option, that is, to use the maximum feature count per document as a divisor for the normalization of the tf. When you look at 'usage', the only options for the scheme argument are "count", "prop", "propmax", "boolean", "log", "augmented" and "logave", so, how can I use the maxCount option?
Asked
Active
Viewed 165 times
1 Answers
1
The short answer is that this is a "bug" in the documentation (for quanteda 0.9.8.0-0.9.8.2), as that option was removed from the function but not the documentation. The new syntax is the propMax
argument, such that:
txt <- c(doc1 = "This is a simple, simple, simple document.",
doc2 = "This document is a second document.")
(myDfm <- dfm(txt, verbose = FALSE))
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 1 1 1 3 1 0
## doc2 1 1 1 0 2 1
Applying the weights:
tf(myDfm, scheme = "prop")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 0.1428571 0.1428571 0.1428571 0.4285714 0.1428571 0
## doc2 0.1666667 0.1666667 0.1666667 0 0.3333333 0.1666667
propmax
is supposed to compute the proportions of each count relative to the most frequent count within document. For doc1, for instance, the maximum feature count is 3, so that each term in that document would be divided by 3. However in quanteda <=0.9.8.2, there was a bug that caused it to wrongly compute this:
tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 1.0000000 1.0000000 1.0000000 3 1.0000000 0
## doc2 0.3333333 0.3333333 0.3333333 0 0.6666667 0.3333333
In quanteda v0.9.8.3, this is fixed:
tf(myDfm, scheme = "propmax")
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs this is a simple document second
## doc1 0.3333333 0.3333333 0.3333333 1 0.3333333 0
## doc2 0.5000000 0.5000000 0.5000000 0 1.0000000 0.5
Note: Fixed in 0.9.8.3.

Ken Benoit
- 14,454
- 27
- 50
-
1Thanks for the quick answer! I do have another question though, is there a way to calculate the 'idf' in quanteda? I only see the 'tf' and tfidf' functions but no 'idf'. – csmontt Oct 17 '16 at 13:42
-
1See ?docfreq. You can always transform that (log, inverse, etc) into idf. – Ken Benoit Oct 17 '16 at 23:35