If my columns have different dimensions for each cell but I want to have similarity scores for each pair, how can I accomplish this? Right now, I'm thinking:
Step 1: Find all the unique values in a specific column.
For example, a column with 100 unique values (arity = 100).
Step 2: For each cell, create a small data frame with
row names of all the unique values (nrow = arity = 100).
And denote the row with a value of 1, if that value appears in that specific cell.
Step 3: calculate the cosine similarity.
For example, my data looks like this. All the unique values are [a,b,c,d]:
var_1
[a,b]
[b,c,d]
[a]
..... (> 10,000 rows)
For Step 2, I will ultimately change the cell to:
var_1
[1,1,0,0] <- in an order of [a,b,c,d], the 1st row has "a" and "b"
[0,1,1,1]
[1,0,0,0]
....
For Step 3, based on the result of Step 2, I can calculate the cosine similarity of each pair of rows. Assume that all the cell data is a list of number 0-1. However, I need to calculate cosine similarity for every pair of rows for each column. My result of cosine similarity table should be something like:
| row_1 | row_2 | row_3
row_1 | 1 | (r1) | (r2)
row_2 | (r1) | 1 | (r3)
row_3 | (r2) | (r3) | 1
Is there any fast way to iterate all the rows and calculate the cosine similarity of each pair?
Thank you so much! Right now, my code looks like following. I have already got all the unique values for each column, and name the unique value's list as "unique". But it cannot give me a result. Is there any better way to do this? My data set is quite large.
myfunction <- function(curr, unique) {
arity <- length(unique)
curr <- matrix()
length(curr) <- arity
dim(curr) <- c(1, arity)
colnames(curr) <- unique
curr.m <- gsub(" ", "", as.character(unique), fixed = TRUE)
curr.m <- unlist(strsplit(curr.m, ",", fixed = TRUE))
curr.m <- curr.m[curr.m != ""]
curr[] <- 0L
curr[, curr.m] = 1
}
for(c in seq_len(length(unique))) {
curr <- all[,c]
curr.u <- unique[[c]]
new <- lapply(curr, myfunction, unique = curr.u)
}