2

I am using the "tm" package in R to create a term document matrix. Then I use "RWeka" to extract trigrams as specified in the code below

myCorpus <- VCorpus(VectorSource(reddata$Tweet))

#create tokenizer function
TriTok<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- DocumentTermMatrix(myCorpus,control=list(tokenize=TriTok))

The problem here is that RWeka seemingly just goes through the list of terms and splits after every three words to get trigrams. For example the sentence:

 On hot summer days I enjoy eating ice cream 

would be split into

"On hot summer"    "days I enjoy"    "eating ice cream"

But for example the phrase

"hot summer days"

would be ignored. Is there a way to get RWeka to include all trigrams or is there another option?

Thanks in advance!

Sebastian
  • 445
  • 5
  • 20
  • 1
    When I apply your `TriTok` function to your phrase, it returns `"On hot summer" "hot summer days" "summer days I" "days I enjoy" "I enjoy eating" "enjoy eating ice" "eating ice cream"`so I'm wondering if the problem is elsewhere. Do you have example data? – Luke C May 29 '17 at 21:52
  • Thanks very much for pointing this out. Went through my code again. I was just too stupid to realise that I had passed the wrong argument to DocumentTermMatrix( control = ). Instead of the TriTok generated by RWeka I was using a custom function. – Sebastian May 30 '17 at 09:21
  • Oh believe me, I've been there. Glad you sorted it! – Luke C May 30 '17 at 18:01

0 Answers0