-3

i'm sorry for new question , but i newbie in text mining, and need in advices of profy. Now, after long torments with content_transformer i have clean corpus The next question

1. How  select from `dtm`  the words with small frequencies , so that the amount of frequencies was not more than 1%

For example i need this format

x 0,5% of all words in the dataset
y 0,2%
z 0,3%

so here total frequencies sum =1% How do this?

fenton
  • 1
  • 1
  • 6

1 Answers1

0

You can take a look into the termDocumentMatrix function of the tm package. This contains a way to count the occurrences of the words per document. Adding these numbers over the total corpus should lead you where you want to be.

dtm <- DocumentTermMatrix(corpus)
# wordcounts for complete corpus
counts <- colSums(as.matrix(dtm))

# number of documents
nb <- length(counts)
# frequencies
freqs <- counts / nb
PinkFluffyUnicorn
  • 1,260
  • 11
  • 20
  • could you show the code please and and how select the words with small frequencies , so that the amount of frequencies was not more than 1% Thank you – fenton Feb 10 '17 at 13:49
  • Thank you, it's good. But how to find the words total frequency sum of which equal=1% and write it in new dataset, can you show me the code? – fenton Feb 10 '17 at 16:58