4

Consider the following list:

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))

How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:

  1. The corresponding correlation coefficient (if it exists)
  2. NA if it does not exists for this word (for example the couple (oil, they) would show NA)
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77

2 Answers2

2

Here's a solution using reshape2 to help reshape the data

library(reshape2)
aa<-do.call(rbind, Map(function(d, n) 
    cbind.data.frame(
      xterm=if (length(d)>0) names(d) else NA, 
      cor=if(length(d)>0) d else NA, 
      term=n),
    a, names(a))
)

dcast(aa, term~xterm, value.var="cor")
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • I'm a bit confused as to why `a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))` works, `a <- findAssocs(tdm, c("oil", "opec", "xyz"), 0.7))` works too but.. `a <- findAssocs(tdm, tdm$dimnames$Terms, 0.7)` does not. After all, isn't tdm$dimnames$Terms a vector in the same form of `c("")` ? – Steven Beaupré Sep 24 '14 at 21:36
  • 1
    It doesn't work because `findAssocs(tdm, c("oil", "opec", "xyz"), 0.7)` compares the terms `c("oil", "opec", "xyz")` to all the *remaining* terms in the corpus. It does not compare "oil" to "opec" for example. It does not calculate the pairwise correlations of the terms in the vector you pass in. So when you pass in every term in the corpus, there's nothing left to compare it so so you get no results. – MrFlick Sep 24 '14 at 21:40
2

Or you could use dplyr and tidyr

 library(dplyr)
 library('devtools')
 install_github('hadley/tidyr')

 library(tidyr)

 a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
                cor=x, stringsAsFactors=FALSE)), term)


  a1 %>% 
     spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
  #  term 15.8 ability above agreement analysts buyers clearly emergency fixed
  #1  oil 0.87      NA  0.76      0.71     0.79   0.70     0.8      0.75  0.73
  #2 opec 0.85     0.8  0.82      0.76     0.85   0.83      NA      0.87    NA
  #  late market meeting prices prices. said that they trying who winter
  #1  0.8   0.75    0.77   0.72      NA 0.78 0.73   NA    0.8 0.8    0.8
  #2   NA     NA    0.88     NA    0.79 0.82   NA  0.8     NA  NA     NA

Update

 aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
 aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
 aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x), 
                     cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
  res <- aNew3 %>% 
              spread(xterm, cor) 

  dim(res)
  #[1] 1021  160

   res[1:3,1:5]
    #     term ... 100,000 10.8 1.1
    #1     ...  NA      NA   NA  NA
    #2 100,000  NA      NA   NA   1
    #3    10.8  NA      NA   NA  NA
akrun
  • 874,273
  • 37
  • 540
  • 662