1

I have a term document matrix. I wish to subset it and keep only those terms which have appeared more than a certain number of times, i.e the row sum should be greater than a specific number. Any quick way to achieve this? B.T.W, the matrix is huge.

NinjaR
  • 621
  • 6
  • 22

2 Answers2

1

yes, so in case you are using the tm package there is a findFreqTerms function that you can use where inside the function you can specify the lowfreq you want:

tdm # your term document matrix
your_terms <- findFreqTerms(tdm, lowfreq = [...]) 

in case you are interested in reducing the tdm by the most frequent terms you can do:

tdm[your_terms, ] 

hope this helps

Codutie
  • 1,055
  • 13
  • 25
  • I am actually not looking for the terms. I want to subset the tdm, so that only the terms appearing through findFreqTerms() remain in tdm. – NinjaR Mar 03 '17 at 08:16
  • I've just updated the answer.. is this what you mean? – Codutie Mar 03 '17 at 08:27
1

In the quanteda package:

require(quanteda)

myDfm <- dfm(data_char_ukimmig2010, remove_punct = TRUE)
myDfm
## Document-feature matrix of: 9 documents, 1,644 features (81.9% sparse).

# remove infrequent terms
dfm_trim(myDfm, min_count = 10, verbose = TRUE)
## Removing features occurring: 
##   - fewer than 10 times: 1,567
##   Total features removed: 1,567 (95.3%).
## Document-feature matrix of: 9 documents, 77 features (32.5% sparse).

Other options exist for removing features based on document frequency, and "sparsity" (a relative measure) as defined in the tm package.

Ken Benoit
  • 14,454
  • 27
  • 50