1

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.

The issue is that it seems to display only 3 letter words and more.

    library(tm)
    library(RWeka)

    test<-'This is a test.'

    testmyCorpus<-Corpus(VectorSource(test))
    testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
    inspect(testTDF)

Only the words "this" and "test" are displayed. Any ideas?

Thanks a lot for you help! Robert

Robert
  • 13
  • 4

1 Answers1

2

Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf) to TermDocumentMatrix.

Community
  • 1
  • 1
Nikita Astrakhantsev
  • 4,701
  • 1
  • 15
  • 26
  • Hi @Robert if this or any answer has solved your question please consider [accepting it](http://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – Nikita Astrakhantsev Mar 12 '15 at 22:01