Arrange the words of the Document Term Matrix by frequency in R

Question

i'm sorry for new question , but i newbie in text mining, and need in advices of profy. Now, after long torments with content_transformer i have clean corpus The next question

1. How  select from `dtm`  the words with small frequencies , so that the amount of frequencies was not more than 1%

For example i need this format

x 0,5% of all words in the dataset
y 0,2%
z 0,3%

so here total frequencies sum =1% How do this?

PinkFluffyUnicorn · Answer 1 · 2017-02-14T16:05:36.190

0

You can take a look into the termDocumentMatrix function of the tm package. This contains a way to count the occurrences of the words per document. Adding these numbers over the total corpus should lead you where you want to be.

dtm <- DocumentTermMatrix(corpus)
# wordcounts for complete corpus
counts <- colSums(as.matrix(dtm))

# number of documents
nb <- length(counts)
# frequencies
freqs <- counts / nb

edited Feb 14 '17 at 16:05

answered Feb 10 '17 at 13:25

PinkFluffyUnicorn

1,260
11
20

could you show the code please and and how select the words with small frequencies , so that the amount of frequencies was not more than 1% Thank you – fenton Feb 10 '17 at 13:49
Thank you, it's good. But how to find the words total frequency sum of which equal=1% and write it in new dataset, can you show me the code? – fenton Feb 10 '17 at 16:58

Arrange the words of the Document Term Matrix by frequency in R

1 Answers1

Linked