How to filter term document matrix based on frequency of occurrence of each term

Question

I have a term document matrix. I wish to subset it and keep only those terms which have appeared more than a certain number of times, i.e the row sum should be greater than a specific number. Any quick way to achieve this? B.T.W, the matrix is huge.

Codutie · Answer 1 · 2017-03-03T08:20:11.723

1

yes, so in case you are using the tm package there is a findFreqTerms function that you can use where inside the function you can specify the lowfreq you want:

tdm # your term document matrix
your_terms <- findFreqTerms(tdm, lowfreq = [...])

in case you are interested in reducing the tdm by the most frequent terms you can do:

tdm[your_terms, ]

hope this helps

edited Mar 03 '17 at 08:20

answered Mar 03 '17 at 08:09

Codutie

1,055
13
25

I am actually not looking for the terms. I want to subset the tdm, so that only the terms appearing through findFreqTerms() remain in tdm. – NinjaR Mar 03 '17 at 08:16
I've just updated the answer.. is this what you mean? – Codutie Mar 03 '17 at 08:27

score 1 · Answer 2 · answered Jun 14 '17 at 22:27

In the quanteda package:

require(quanteda)

myDfm <- dfm(data_char_ukimmig2010, remove_punct = TRUE)
myDfm
## Document-feature matrix of: 9 documents, 1,644 features (81.9% sparse).

# remove infrequent terms
dfm_trim(myDfm, min_count = 10, verbose = TRUE)
## Removing features occurring: 
##   - fewer than 10 times: 1,567
##   Total features removed: 1,567 (95.3%).
## Document-feature matrix of: 9 documents, 77 features (32.5% sparse).

Other options exist for removing features based on document frequency, and "sparsity" (a relative measure) as defined in the tm package.

How to filter term document matrix based on frequency of occurrence of each term

2 Answers2