Calculate similarity score for cells with different dimensions in R

Question

If my columns have different dimensions for each cell but I want to have similarity scores for each pair, how can I accomplish this? Right now, I'm thinking:

Step 1: Find all the unique values in a specific column. 
   For example, a column with 100 unique values (arity = 100).

Step 2: For each cell, create a small data frame with 
 row names of all the unique values (nrow = arity = 100).
 And denote the row with a value of 1, if that value appears in that specific cell.

Step 3: calculate the cosine similarity.

For example, my data looks like this. All the unique values are [a,b,c,d]:

    var_1     
    [a,b] 
    [b,c,d] 
    [a] 
    ..... (> 10,000 rows)

For Step 2, I will ultimately change the cell to:

     var_1     
    [1,1,0,0] <- in an order of [a,b,c,d], the 1st row has "a" and "b"
    [0,1,1,1] 
    [1,0,0,0] 
    ....

For Step 3, based on the result of Step 2, I can calculate the cosine similarity of each pair of rows. Assume that all the cell data is a list of number 0-1. However, I need to calculate cosine similarity for every pair of rows for each column. My result of cosine similarity table should be something like:

           | row_1  | row_2 | row_3  
    row_1  |   1    |  (r1) |  (r2)
    row_2  |  (r1)  |   1   |  (r3)
    row_3  |  (r2)  |  (r3) |    1

Is there any fast way to iterate all the rows and calculate the cosine similarity of each pair?

Thank you so much! Right now, my code looks like following. I have already got all the unique values for each column, and name the unique value's list as "unique". But it cannot give me a result. Is there any better way to do this? My data set is quite large.

 myfunction <- function(curr, unique) {
  arity <- length(unique)
  curr <- matrix()
  length(curr) <- arity
  dim(curr) <- c(1, arity)
  colnames(curr) <- unique
  curr.m <- gsub(" ", "", as.character(unique), fixed = TRUE)
  curr.m <- unlist(strsplit(curr.m, ",", fixed = TRUE))
  curr.m <- curr.m[curr.m != ""]
  curr[] <- 0L
  curr[, curr.m] = 1     
}

for(c in seq_len(length(unique))) {
  curr <- all[,c]
  curr.u <- unique[[c]]
  new <- lapply(curr, myfunction, unique = curr.u)
}

A small example data set and the desired answer would help people understand what you want and provide possible solutions. — Mark Miller, Nov 28 '15 at 09:39
@MarkMiller, Thank you for your comment! I've updated the post. Does it make more sense? — Wenkai Ying, Nov 28 '15 at 09:55

Calculate similarity score for cells with different dimensions in R

0 Answers0