I have a term document matrix. I wish to subset it and keep only those terms which have appeared more than a certain number of times, i.e the row sum should be greater than a specific number. Any quick way to achieve this? B.T.W, the matrix is huge.
Asked
Active
Viewed 1,830 times
2 Answers
1
yes, so in case you are using the tm
package there is a findFreqTerms function that you can use where inside the function you can specify the lowfreq you want:
tdm # your term document matrix
your_terms <- findFreqTerms(tdm, lowfreq = [...])
in case you are interested in reducing the tdm by the most frequent terms you can do:
tdm[your_terms, ]
hope this helps

Codutie
- 1,055
- 13
- 25
-
I am actually not looking for the terms. I want to subset the tdm, so that only the terms appearing through findFreqTerms() remain in tdm. – NinjaR Mar 03 '17 at 08:16
-
I've just updated the answer.. is this what you mean? – Codutie Mar 03 '17 at 08:27
1
In the quanteda package:
require(quanteda)
myDfm <- dfm(data_char_ukimmig2010, remove_punct = TRUE)
myDfm
## Document-feature matrix of: 9 documents, 1,644 features (81.9% sparse).
# remove infrequent terms
dfm_trim(myDfm, min_count = 10, verbose = TRUE)
## Removing features occurring:
## - fewer than 10 times: 1,567
## Total features removed: 1,567 (95.3%).
## Document-feature matrix of: 9 documents, 77 features (32.5% sparse).
Other options exist for removing features based on document frequency, and "sparsity" (a relative measure) as defined in the tm package.

Ken Benoit
- 14,454
- 27
- 50