(Fast) word frequency matrix in R

Question

I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. I've been using the wfm and wfdf functions from the qdap package, but have noticed that this is a bit slow for my needs. It appears that the production of the word-frequency matrix is the bottleneck.

The code for my function is as follows.

library(qdap)
liwcr <- function(inputText, dict) {
  if(!file.exists(dict)) 
    stop("Dictionary file does not exist.")

  # Read in dictionary categories
  # Start by figuring out where the category list begins and ends
  dictionaryText <- readLines(dict)
  if(!length(grep("%", dictionaryText))==2)
    stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")

  catStart <- grep("%", dictionaryText)[1]
  catStop <- grep("%", dictionaryText)[2]
  dictLength <- length(dictionaryText)

  dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))

  wordCount <- word_count(inputText)

  outputFrame <- dictionaryCategories
  outputFrame["count"] <- 0

  # Now read in dictionary words

  no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
  dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)

  workingMatrix <- wfdf(inputText)
  for (i in workingMatrix[,1]) {
    if (i %in% dictionaryWords[, 1]) {
      occurrences <- 0
      foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
      foundCategories <- foundWord[1,2:no_col]
      for (w in foundCategories) {
        if (!is.na(w) & (!w=="")) {
          existingCount <- outputFrame[outputFrame$V1 == w,]$count
          outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
        }
      }
    }
  }
  return(outputFrame)
}

I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. Example:

library(qdap)
fn <- reports::folder(delete_me)
n <- 10000

lapply(1:n, function(i) {
    out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
    cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})

filename <- sprintf("tweet%s.txt", 1:n)

for(i in 1:length(filename)){
  print(filename[i])
  text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
  freq <- wfm(text)
}

The input files are Twitter and Facebook status postings.

Is there any way to improve the speed for this code?

EDIT2: Due to institutional restrictions, I can't post any of the raw data. However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. There are also an additional 100k files with Facebook status updates, structured in the same way.

It looks like your question might be closed soon. Perhaps add a sample data set and the code to analyze it. Then ask how to improve that code. That might keep your question open. — Mark Miller, Jun 22 '14 at 04:09
I think you may want to create a `TermDocumentMatrix` with the `tm` package. They are essentially the same but `tm` is optimized for larger data sets (qdap is designed for smaller more structured data sets of transcripts). If you really want the `wfm` then coerce the `TermDocumentMatrix` using `as.wfm`. See the [qdap-tm compatibility vignette](http://cran.r-project.org/web/packages/qdap/vignettes/tm_package_compatibility.pdf) for more. — Tyler Rinker, Jun 22 '14 at 04:21
Thanks @MarkMiller. The problem is that I'm designing a package that I'm hoping to submit to CRAN, so it's not specific to a single dataset. I've updated to include the code for my function. — Sean C. Rife, Jun 22 '14 at 04:58
You haven't provided data, this may stop a 5th and final person from closing. What I am looking at is a ton of `for` loops. I'm guessing you could do a bunch of this in a much more vectorized format. I'm also guessing you could read all the data in at once and store as a data.frame with different file identifiers as this is how qdap is intended to be used. So you'd have one column for `text.var` and one column for `grouping.var` (the later is the `filename[i]`) You could do this with `lapply`, `setNames` and `qdapTools::list2df`. But supply 3 sample tweets for help. — Tyler Rinker, Jun 22 '14 at 12:38
I added a reproducible example to help keep this from being closed. — Tyler Rinker, Jun 22 '14 at 12:48
Many thanks, @TylerRinker! I'll give it a shot. Unfortunately, I can't post the actual data files (or anything from them) due to institutional restrictions (silly, I know, as it's public data - but I'm in academia, and the rules are a bit strict). — Sean C. Rife, Jun 22 '14 at 13:35
Removed explicit request for package recommendation. The question should now conform to on-topic criteria. — Sean C. Rife, Jun 22 '14 at 14:13
@Sean You have updated the text, but you were still not asking a question ;-) I have added one. — Jan Doggen, Jun 27 '14 at 13:55

Tyler Rinker · Accepted Answer · 2014-06-22T13:34:14.453

Here is a qdap approach and a mixed qdap/tm approach that is faster. I provide the code and then the timings on each. Basically I read everything in at once and operator on the entire data set. You could then split it back apart if you wanted with split.

A MWE that you should provide with questions

library(qdap)
fn <- reports::folder(delete_me)
n <- 10000

lapply(1:n, function(i) {
    out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
    cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})

filename <- sprintf("tweet%s.txt", 1:n)

The qdap approach

tic <- Sys.time() ## time it

dat <- list2df(setNames(lapply(filename, function(x){
    readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")

difftime(Sys.time(), tic) ## time to read in

the_wfm <- with(dat, wfm(text, tweet))

difftime(Sys.time(), tic)  ## time to make wfm

Timing qdap approach

> tic <- Sys.time() ## time it
> 
> dat <- list2df(setNames(lapply(filename, function(x){
+     readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
> 
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
> 
> the_wfm <- with(dat, wfm(text, tweet))
> 
> difftime(Sys.time(), tic)  ## time to make wfm
Time difference of 48.9238 secs

The qdap-tm combined approach

tic <- Sys.time() ## time it

dat <- list2df(setNames(lapply(filename, function(x){
    readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")

difftime(Sys.time(), tic) ## time to read in


tweet_corpus <- with(dat, as.Corpus(text, tweet))

tdm <- tm::TermDocumentMatrix(tweet_corpus,
    control = list(removePunctuation = TRUE,
    stopwords = FALSE))

difftime(Sys.time(), tic)  ## time to make TermDocumentMatrix

Timing qdap-tm combined approach

> tic <- Sys.time() ## time it
> 
> dat <- list2df(setNames(lapply(filename, function(x){
+     readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
> 
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
> 
> 
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
> 
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+     control = list(removePunctuation = TRUE,
+     stopwords = FALSE))
> 
> difftime(Sys.time(), tic)  ## time to make TermDocumentMatrix
Time difference of 13.52377 secs

There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. As you can see on 10000 tweets the combined approach is ~3.5 x faster. A purely tm approach may be faster still. Also if you want the wfm use as.wfm(tdm) to coerce the TermDocumentMatrix.

Your code though is slower either way because it's not the R way to do things. I'd recommend reading some additional info on R to get better at writing faster code. I'm currently working through Hadley Wickham's Advanced R that I'd recommend.

Note that as of **qdap** version >= 2.1.1 `wfm` incorporates more **tm** package as a back end. This speeds up `wfm` greatly. — Tyler Rinker, Jun 23 '14 at 21:45

(Fast) word frequency matrix in R

1 Answers1