Using a trigram-tokenizer from the RWeka
class
> TriGramTokenizer <- function(x){NGramTokenizer(x, Weka_control(min=3, max=3))}
I tokenized a corpus. Inspection shows that the trigrams look like this:
> inspect(tdm_trigram[1:10, 1:3])
A term-document matrix (10 terms, 3 documents)
Non-/sparse entries: 10/20
Sparsity : 67%
Maximal term length: 17
Weighting : term frequency (tf)
Docs
Terms en_US.blogs.capped.txt en_US.news.capped.txt
\u0097 age believe 0 1
\u0095 all tradeable 0 1
\u0093 amazing feat\u0094 0 1
\u0097 appear poised 0 1
\u0096 areas muslim 0 1
What's the \u0097
? I preprocessed my corpus with the usual methods from the tm
library (stripWhitespace, remove punctuation and so on).
Should I perhaps readin using a different encoding?