3

I am generating a matrix using the lsa package in R. After the matrix is created, I would like to calculate the cosine similarity between specific pairs of documents (columns) in the matrix.

Currently, I am doing this with nested for-loops, and it is monstrously slow. In the code below, there are 150 sourceIDs and 6413 targetIDs, for a total of 961.950 comparisons. After an hour and a half on my number-crunching machine, it has only gotten through ~300k of them.

For more info, sourceIDs and targetIDs are vectors of column names, loaded in from two files containing those names. I want to apply the cosine function between all of the source->target pairs. The columns are indexed by the document name, which is a string.

I am sure there is a much faster way of doing this with apply, but I just cant wrap my head around it.

library(lsa)

# tf function
real_tf <- function(m)
{
    return (sweep(m, MARGIN=2, apply(m, 2, max), "/"))
}

#idf function
real_idf <- function(m)
{
    df = rowSums(lw_bintf(m), na.rm=TRUE)
    return (log(ncol(m)/df))
}

#load corpus
lsa.documents <- textmatrix(args[1], minWordLength=1, minDocFreq=0)

# compute tf-idf
lsa.weighted_documents <- real_tf(lsa.documents) * real_idf(lsa.documents)

# compute svd
lsa.nspace <- lsa(lsa.weighted_documents, dims = as.integer(args[5]))
lsa.matrix <- diag(lsa.nspace$sk) %*% t(lsa.nspace$dk)

# compute similarities
lsa.sourceIDs <- scan(args[2], what = character())
lsa.targetIDs <- scan(args[3], what = character())
lsa.similarities <- data.frame(SourceID=character(), TargetID=character(), Score=numeric(), stringsAsFactors=FALSE)
k <- 1
for (i in lsa.sourceIDs)
{
    for (j in lsa.targetIDs)
    {
        lsa.similarities[k,] <- c(i, j, cosine(lsa.matrix[,i], lsa.matrix[,j]))
        k <- k + 1
    }
}
lsa.ranklist <- lsa.similarities[order(lsa.similarities$Score, decreasing=TRUE),]

# save ranklist
write.table(lsa.ranklist, args[4], sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)

Edit: Reproducible example

# cosine function from lsa package
cosine <- function( x, y )
{
    return ( crossprod(x,y) / sqrt( crossprod(x)*crossprod(y) ) )
}

theMatrix <- structure(c(-0.0264639232505822, -0.0141165039351167, -0.0280459775632757, 
-0.041211247161448, -0.00331565717239375, -0.0291161345945683, 
-0.0451167802746869, -0.0116214407383401, -0.0381080747718958, 
-1.36693644389599, 0.274747343110076, 0.128100677705483, -0.401760905661056, 
-1.24876927957167, 0.368479552862631, -0.459711112157286, -0.544344448332346, 
-0.765378939625159, -1.28612431910459, 0.293455499695499, 0.025167452173962
), .Dim = c(3L, 7L), .Dimnames = list(NULL, c("doc1", "doc2", "doc3", 
"doc4", "doc5", "doc6", "doc7")))

sources <- c("doc1", "doc2", "doc3")
targets <- c("doc4", "doc5", "doc6", "doc7")

similarities <- data.frame(SourceID=character(), TargetID=character(), Score=numeric(), stringsAsFactors=FALSE)
k <- 1

for (i in sources)
{
    for (j in targets)
    {
        similarities[k,] <- c(i, j, cosine(theMatrix[,i], theMatrix[,j]))
        k <- k + 1
    }
}

ranklist <- similarities[order(similarities$Score, decreasing=TRUE),]
write.table(ranklist, "C:\\Temp\\outputfile.txt", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)

Which produces (outputfile.txt):

doc1    doc6    0.962195242094352
doc3    doc6    0.893461576046585
doc2    doc6    0.813856201398669
doc2    doc7    0.768837903803964
doc2    doc4    0.730093288388069
doc3    doc7    0.675640649189972
doc3    doc4    0.635982900340315
doc1    doc7    0.53871688669971
doc1    doc4    0.499235059782688
doc1    doc5    0.320383772495164
doc3    doc5    0.226751624753921
doc2    doc5    0.144680489733846
E. Moritz
  • 51
  • 1
  • 6
  • 1
    It would be easier if you were to provide a more reproducible example, as even after installing `lsa` I apparently need to install something else (Java?). I'd just give a very basic sample data frame with `dput` and the result expected. I imagine the lsa package itself is irrelevant to the problem of matrix population. – Maxim.K Apr 21 '13 at 12:36
  • From what I could understand in your code, you have two vectors of equal length containing some values. As output you want to have the cosine values for all combinations of the elements of these initial vectors. If this is correct, than `outer()` will probably help you. – Maxim.K Apr 21 '13 at 12:49
  • @Maxim, if that's the answer then the OP is really bad at jeopardy! – flodel Apr 21 '13 at 12:51
  • I've added a reproducible example to the original post. – E. Moritz Apr 21 '13 at 13:02
  • It is slow because you did not pre-allocate `lsa.similarities`. – flodel Apr 21 '13 at 13:16

1 Answers1

5

Ok, thanks for the reproducible example. Here is a possible solution. Let's first split your theMatrix into source and target matrices. We do not need to use the names here, as we will not use loops:

matrix1 <- theMatrix[,1:3]
matrix2 <- theMatrix[,4:7]

Then we will create a function to loop through every column of matrix2, keeping a single column from matrix1 constant:

cycleM2 <- function(x) {
    # x is a vector from matrix1 
    apply(matrix2,2,cosine,x)
}

Finally, we will supply this function to every column of matrix1:

(mydata <- apply(matrix1,2,cycleM2))

#      doc1      doc2      doc3
# doc4 0.4992351 0.7300933 0.6359829
# doc5 0.3203838 0.1446805 0.2267516
# doc6 0.9621952 0.8138562 0.8934616
# doc7 0.5387169 0.7688379 0.6756406

Finally, if you really need your original data format:

require(reshape2)
melt(mydata)

This should speed up your code nicely. Also, as @flodel has noticed, when you use loops, pre-allocate your (empty) target object in memory, filling it e.g. with NA. Memory allocations are the most costly in terms of time, and that is why your original loop was so slow.

EDIT:

A better form using pure function would probably be:

pairwiseCosine <- function(matrix1,matrix2) {
    apply(matrix1,2,function(x){
        apply(matrix2,2,cosine,x)
    })
}

pairwiseCosine(theMatrix[,1:3],theMatrix[,4:7])
Maxim.K
  • 4,120
  • 1
  • 26
  • 43