Extracting all possible ngrams in R tm document term matrix

Asked May 29 '17 at 20:05

Active May 31 '17 at 13:15

Viewed 1,183 times

I am using the "tm" package in R to create a term document matrix. Then I use "RWeka" to extract trigrams as specified in the code below

myCorpus <- VCorpus(VectorSource(reddata$Tweet))

#create tokenizer function
TriTok<- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- DocumentTermMatrix(myCorpus,control=list(tokenize=TriTok))

The problem here is that RWeka seemingly just goes through the list of terms and splits after every three words to get trigrams. For example the sentence:

 On hot summer days I enjoy eating ice cream

would be split into

"On hot summer"    "days I enjoy"    "eating ice cream"

But for example the phrase

"hot summer days"

would be ignored. Is there a way to get RWeka to include all trigrams or is there another option?

Thanks in advance!

asked May 29 '17 at 20:05

Sebastian

1

When I apply your `TriTok` function to your phrase, it returns `"On hot summer" "hot summer days" "summer days I" "days I enjoy" "I enjoy eating" "enjoy eating ice" "eating ice cream"`so I'm wondering if the problem is elsewhere. Do you have example data? – Luke C May 29 '17 at 21:52
Thanks very much for pointing this out. Went through my code again. I was just too stupid to realise that I had passed the wrong argument to DocumentTermMatrix( control = ). Instead of the TriTok generated by RWeka I was using a custom function. – Sebastian May 30 '17 at 09:21
Oh believe me, I've been there. Glad you sorted it! – Luke C May 30 '17 at 18:01

Extracting all possible ngrams in R tm document term matrix

0 Answers0