What's the first element in my trigrams?

Question

Using a trigram-tokenizer from the RWeka class

> TriGramTokenizer <- function(x){NGramTokenizer(x, Weka_control(min=3, max=3))}

I tokenized a corpus. Inspection shows that the trigrams look like this:

> inspect(tdm_trigram[1:10, 1:3])
A term-document matrix (10 terms, 3 documents)

Non-/sparse entries: 10/20
Sparsity           : 67%
Maximal term length: 17 
Weighting          : term frequency (tf)

                           Docs
Terms                       en_US.blogs.capped.txt en_US.news.capped.txt
  \u0097 age believe                             0                     1
  \u0095 all tradeable                           0                     1
  \u0093 amazing feat\u0094                      0                     1
  \u0097 appear poised                           0                     1
  \u0096 areas muslim                            0                     1

What's the \u0097 ? I preprocessed my corpus with the usual methods from the tm library (stripWhitespace, remove punctuation and so on).

Should I perhaps readin using a different encoding?

Henry · Answer 1 · 2016-11-28T16:17:15.530

2

These are Unicode control characters you have interpreted as words.

In older versions of Unicode

U+0097 was END OF GUARDED AREA
U+0095 was MESSAGE WAITING
U+0093 was SET TRANSMIT STATE
U+0096 was START OF GUARDED AREA

You may want to strip them out before your trigrams

edited Nov 28 '16 at 16:17

answered Jul 18 '15 at 10:42

Henry

6,704
2
23
39

What's the first element in my trigrams?

1 Answers1