Empty term document matrix

Question

I seem to run into a problem whenever I try to inspect my freq. words and associations.

When I make the tdm I get this info: TermDocumentMatrix

I can see I have plenty of terms to use, in plenty of documents. However!

When I try to inspect the content of "tdm", I get this info: Inspecting the TDM

Howcome the tdm all of a sudden is empty?

Hope someone can help

tweets <- userTimeline("RDataMining", n = 1000)

(n.tweet <- length(tweets))
tweets[1:3]

#convert tweets to a data frame
tweets.df <- twListToDF(tweets)
dim(tweets.df)


##Text cleaning
library(tm)
#build a corpus and specify the source to be a character vector
myCorpus <- Corpus(VectorSource(tweets.df$text))

#convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 

#remove URLs
removeURL <- function(x) gsub ("http[^[:space:]]*","",x) 
myCorpus <- tm_map(myCorpus,content_transformer(removeURL))

#remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*","",x)
myCorpus <- tm_map(myCorpus,content_transformer(removeNumPunct))

#remove stopwords + 2
myStopwords <- c(stopwords('english'),"available","via")
#remove "r" and "big" from stopwords
myStopwords <- setdiff(myStopwords, c("r","big"))
#remove stopwords from corpus
myCorpus <- tm_map(myCorpus,removeWords,myStopwords)
#remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

#keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy <- myCorpus

#stem words
library(SnowballC)
myCorpus <- tm_map(myCorpus,stemDocument)
stemCompletion2 <- function(x,dictionary) {
x <- unlist(strsplit(as.character(x),""))

#because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this

 x <- x[x !=""]
 x <- stemCompletion(x, dictionary = dictionary)
 x <- paste (x,sep = "",collapse = "")
 PlainTextDocument(stripWhitespace(x))
}

myCorpus <- lapply(myCorpus, stemCompletion2, dictionary = myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))

#count freq of "mining"
miningCases <- lapply(myCorpusCopy,
                  function(x) {grep(as.character(x),pattern = "\\<mining")})
sum(unlist(miningCases))

#count freq of "miner"
miningCases <- lapply(myCorpusCopy,
                  function(x) {grep(as.character(x),pattern = "\\<miner")})
sum(unlist(miningCases))

#count freq of "r"
miningCases <- lapply(myCorpusCopy,
                  function(x) {grep(as.character(x),pattern = "\\<r")})
sum(unlist(miningCases))

#replace "miner" with "mining"
myCorpus <- tm_map(myCorpus,content_transformer(gsub),
               pattern = "miner", replacement = "mining")

tdm <- TermDocumentMatrix(myCorpus, control = list(removePunctuation =    TRUE,stopwords = TRUE))
tdm

##Freq words and associations
idx <- which(dimnames(tdm)$Terms == "r")
inspect(tdm[idx + (0:5), 101:110])

#inspect frequent words
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq,term.freq >= 15)
df <- data.frame(term = names(term.freq), freq = term.freq)

if i use: inspect (tdm) i get a long list of strings. – Theis Abildgaard Rasmussen May 28 '16 at 10:46 — Theis Abildgaard Rasmussen, May 28 '16 at 10:46

score 0 · Answer 1 · answered May 29 '16 at 14:57

I've been using the following Twitter query to test your code:

tweets = searchTwitter("r data mining", n=10)

and I think the problem is with your function stemCompletion2, which should look something like this:

stemCompletion2 <- function(x,dictionary) {
  x <- unlist(strsplit(as.character(x)," "))
  print("before:")
  print(x)

  #because stemCompletion completes an empty string to a word in dict. Remove empty string to avoid this
  x <- x[x !=""]
  x <- stemCompletion(x, dictionary = dictionary)
  print("after:")
  print(x)
  x <- paste(x, sep = " ")
  PlainTextDocument(stripWhitespace(x))
}

The modifications are as follows: before you had

x <- unlist(strsplit(as.character(x),""))

which was creating a list with all the characters of in each of the documents, and I've modified it to

x <- unlist(strsplit(as.character(x)," "))

to create a list of words. Similarly, when recomposing your documents, you where doing

x <- paste (x,sep = "",collapse = "")

which was creating the long strings you mention in your post, and I've modified it to:

x <- paste(x, sep = " ")

to recompose the words.

One example of the completions would be for my data:

[1] "before:"
 [1] "rt"             "ebookdealalert" "r"              "datamin"        "project"        "learn"          "data"           "mine"          
 [9] "realworld"      "project"        "book"           "solv"           "predict"        "model"         
[1] "after:"
               rt    ebookdealalert                 r           datamin           project             learn              data              mine 
             "rt" "ebookdealalerts"               "r"      "datamining"        "projects"           "learn"            "data"                "" 
        realworld           project              book              solv           predict             model 
      "realworld"        "projects"            "book"           "solve"      "predictive"        "modeling"

After that step, you may be able to work with TermDocumentMatrix as expected.

Hope it helps.

Empty term document matrix

1 Answers1