0

I am using the following tm+RWeka code to extract the most frequent ngrams in texts:

library("RWeka")
library("tm")

text <- c('I am good person','I am bad person','You are great','You are more great','todo learn english','He is ok')
BigramTokenizer <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
corpus <- Corpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,control = list(tokenize = BigramTokenizer))

DF <- data.frame(inspect(tdm))
DF$sums <- DF$X1+DF$X2+DF$X3+DF$X4+DF$X5+DF$X6
MostFreqNgrams <- rownames(head(DF[with(DF,order(-sums)),]))

It is working ok, but what if the data is way bigger? Is there a computational more efficient way? Furthermore, if the variables are more(ex. 100) how can i write the DF$sums code line. For sure there is something more elegant than the followin:

DF$sums <- DF$X1+DF$X2+DF$X3+DF$X4+DF$X5+DF$X6+...+DF$X99+DF$X100

Thank you

EDIT: I am wondering if there is a way to extract the most frequent ngrams from tdm TermDocumentMatrix and after create a dataframe with the values. What I am doing is to create a dataframe with all the ngrams and after take the most frequent values which seems not to be the best choice.

smci
  • 32,567
  • 20
  • 113
  • 146
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
  • 1
    You can use either `Reduce('+', DF)` or `rowSums(DF)` – akrun Nov 08 '15 at 18:19
  • Related, possibly duplicate: [CPU-and-memory efficient NGram extraction with R](http://stackoverflow.com/questions/31424687/cpu-and-memory-efficient-ngram-extraction-with-r) – smci Jul 18 '16 at 21:57

2 Answers2

1

There is an easier and more efficient way, using the quanteda package for text analysis.

> require(quanteda)
> dtm <- dfm(text, ngrams = 2)
Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 6 documents
   ... indexing features: 13 feature types
   ... created a 6 x 13 sparse dfm
   ... complete. 
Elapsed time: 0.007 seconds.
> topfeatures(dtm, n = 10)
       i_am     you_are     am_good good_person      am_bad  bad_person   are_great    are_more 
          2           2           1           1           1           1           1           1 
 more_great  todo_learn 
          1           1 

The resulting matrix is sparse and very efficient. In the GitHub version, the ngrams() function (which is called by dfm()) is implemented in C++ for speed, so it's even faster.

Ken Benoit
  • 14,454
  • 27
  • 50
0

Based on your edit you could use the following:

my_matrix <- as.matrix(tdm[findFreqTerms(tdm, lowfreq = 2),])
DF <- data.frame(my_matrix, sums = rowSums(my_matrix))
DF
        X1 X2 X3 X4 X5 X6 sums
i am     1  1  0  0  0  0    2
you are  0  0  1  1  0  0    2
phiver
  • 23,048
  • 14
  • 44
  • 56