tm package: Output of findAssocs() in a matrix instead of a list in R

Question

Consider the following list:

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))

How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:

The corresponding correlation coefficient (if it exists)
NA if it does not exists for this word (for example the couple (oil, they) would show NA)

Updated the solution based on your new `a`. – akrun Sep 24 '14 at 08:05 — akrun, Sep 24 '14 at 08:05

score 2 · Accepted Answer · answered Sep 24 '14 at 04:27

2

Here's a solution using reshape2 to help reshape the data

library(reshape2)
aa<-do.call(rbind, Map(function(d, n) 
    cbind.data.frame(
      xterm=if (length(d)>0) names(d) else NA, 
      cor=if(length(d)>0) d else NA, 
      term=n),
    a, names(a))
)

dcast(aa, term~xterm, value.var="cor")

answered Sep 24 '14 at 04:27

MrFlick

195,160
17
277
295

I'm a bit confused as to why `a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))` works, `a <- findAssocs(tdm, c("oil", "opec", "xyz"), 0.7))` works too but.. `a <- findAssocs(tdm, tdm$dimnames$Terms, 0.7)` does not. After all, isn't tdm$dimnames$Terms a vector in the same form of `c("")` ? – Steven Beaupré Sep 24 '14 at 21:36
1

It doesn't work because `findAssocs(tdm, c("oil", "opec", "xyz"), 0.7)` compares the terms `c("oil", "opec", "xyz")` to all the *remaining* terms in the corpus. It does not compare "oil" to "opec" for example. It does not calculate the pairwise correlations of the terms in the vector you pass in. So when you pass in every term in the corpus, there's nothing left to compare it so so you get no results. – MrFlick Sep 24 '14 at 21:40

akrun · Answer 2 · 2014-09-24T08:05:20.573

Or you could use dplyr and tidyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')

 library(tidyr)

 a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
                cor=x, stringsAsFactors=FALSE)), term)


  a1 %>% 
     spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
  #  term 15.8 ability above agreement analysts buyers clearly emergency fixed
  #1  oil 0.87      NA  0.76      0.71     0.79   0.70     0.8      0.75  0.73
  #2 opec 0.85     0.8  0.82      0.76     0.85   0.83      NA      0.87    NA
  #  late market meeting prices prices. said that they trying who winter
  #1  0.8   0.75    0.77   0.72      NA 0.78 0.73   NA    0.8 0.8    0.8
  #2   NA     NA    0.88     NA    0.79 0.82   NA  0.8     NA  NA     NA

Update

 aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
 aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
 aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x), 
                     cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
  res <- aNew3 %>% 
              spread(xterm, cor) 

  dim(res)
  #[1] 1021  160

   res[1:3,1:5]
    #     term ... 100,000 10.8 1.1
    #1     ...  NA      NA   NA  NA
    #2 100,000  NA      NA   NA   1
    #3    10.8  NA      NA   NA  NA

What if the `a` was: `a <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))` — Steven Beaupré, Sep 24 '14 at 05:26
@Steven Beaupre The `a` is taken from your post `a <- findAssocs(tdm,...)` . Your new code `a <- sapply(...)` takes a long time to run. — akrun, Sep 24 '14 at 05:27

tm package: Output of findAssocs() in a matrix instead of a list in R

2 Answers2

Update