1

Trying to do some text mining with R without removing any special characters. For example in the following "LKC" and "LKC_" should be different words. Instead it is dropping the _ and making it the same word. How can I accomplish this?

library(tm)
special = c("OLAC_ LA LAC LAC_ LAC_E AC AC_ AC_E AC_ET",
                      ")LK )LKC )LKC- LK LKC LKC-",
                      "LAC_ LAC_E LKC LKC-")

bagOfWords <- Corpus(VectorSource(special))


mydocsDTM <- DocumentTermMatrix(bagOfWords, control = list(removePunctuation = FALSE,                     
preserve_intra_word_contractions = FALSE,         
preserve_intra_word_dashes = FALSE,
removeNumbers = FALSE,
stopwords = FALSE,
stemming = FALSE
))

  inspect(mydocsDTM)
jz_
  • 338
  • 2
  • 14

1 Answers1

0

Easily done using the quanteda package, after which you can convert to a DocumentTermMatrix, or just keep using quanteda.

library("quanteda")
qdfm <- dfm(special, tolower = FALSE, what = "fasterword")
qdfm
# Document-feature matrix of: 3 documents, 15 features (57.8% sparse).
# 3 x 15 sparse Matrix of class "dfm"
#        features
# docs    OLAC_ LA LAC LAC_ LAC_E AC AC_ AC_E AC_ET )LK )LKC )LKC- LK LKC LKC-
#   text1     1  1   1    1     1  1   1    1     1   0    0     0  0   0    0
#   text2     0  0   0    0     0  0   0    0     0   1    1     1  1   1    1
#   text3     0  0   0    1     1  0   0    0     0   0    0     0  0   1    1

convert(qdfm, to = "tm")
# <<DocumentTermMatrix (documents: 3, terms: 15)>>
# Non-/sparse entries: 19/26
# Sparsity           : 58%
# Maximal term length: 5
# Weighting          : term frequency (tf)
Ken Benoit
  • 14,454
  • 27
  • 50
  • Thank you that works perfectly. The quanteda package seems to have many efficiency improvements. I did notice I had to remove the **what = "fastword"** option to produce matrix on my larger documents. My feature count is up to 61,000 -- frankly a data issue I will need to work on. – jz_ Feb 11 '18 at 12:56
  • Correction -- it appears the error I am hitting is 10000 documents – jz_ Feb 11 '18 at 15:34
  • Hmmm, we batch process tokens in 10,000 document chunks, and it sounds like this could be related. Try `tokens(x, what = "fasterword") %>% dfm(tolower = FALSE)` and see if the first part creates the error. If it does, please file an issue. – Ken Benoit Feb 11 '18 at 15:36
  • I opened this issue: https://github.com/quanteda/quanteda/issues/1225. Feel free to add your experiences to that. – Ken Benoit Feb 11 '18 at 16:06