Efficient way to calculate cosine similarity by ignoring for loop

Question

I am trying to calculate cosine similarity using stringdist function from stringdist package in R. I want to get average cosine similarity for each row in scoring_dt by calculating cosine similarity with each row of baseline_dt and taking mean for all the values. I am successfully getting results using below code. But, I am looking for much efficient code because below nested for loop is very slow for large dataset.

 baseline_dt <- read.table(text="id Product.Group.Code   R1   R2   R3   R4   S1   S2   S3   U1   U2   U3 U4 U6
    91  65418                164 0.68 0.70 0.50 0.59   NA   NA 0.96   NA 0.68   NA NA NA
    93  57142                164   NA 0.94   NA   NA 0.83   NA   NA 0.54   NA   NA NA NA
    99  66740                164 0.68 0.68 0.74   NA 0.63 0.68 0.72   NA   NA   NA NA NA
    100 76712                164 0.54 0.54 0.40   NA 0.39 0.39 0.39 0.50   NA 0.50 NA NA
    101 56463                164 0.67 0.67 0.76   NA   NA 0.76 0.76 0.54   NA   NA NA NA
    125 11713                164   NA   NA   NA   NA   NA 0.88   NA   NA   NA   NA NA NA",header=TRUE)


 scoring_dt <- read.table(text="id Product.Group.Code   R1   R2   R3   R4   S1   S2   S3   U1   U2   U3 U4 U6
11  999                164 0.68 0.70 0.50 0.59   0.7   NA 0.96   NA 0.68   NA NA NA
22  555                164   0 0.94   0   NA 0.83   0.6   NA 0.54   NA   NA NA NA",header=TRUE)

Please find R code below.

dc  <- setNames(data.frame(matrix(ncol = 3, nrow = 0)), c("baseline_id", "scoring_id", "cosine_score"))
    dt  <- setNames(data.frame(matrix(ncol = 2, nrow = 0)), c("scoring_id", "Avg_cosine_score"))
    predictor <- c("R1" ,"R2" ,"R3" ,"R4", "S1", "S2", "S3", "U1", "U2" ,"U3", "U4" ,"U6")

    id <-"id"
    baseline_dt <- data.table::setDT(baseline_dt)
    scoring_dt <- data.table::setDT(scoring_dt)

    for(i in 1:length(scoring_dt[[id]])){

      for(j in 1:length(baseline_dt[[id]])){

        dc[j,1] <- baseline_dt[[id]][j]
        dc[j,2] <- scoring_dt[[id]][i]
        cos <- stringdist::stringdist(as.character(baseline_dt[ ,predictor ,with=F][j,]),as.character(scoring_dt[,predictor,with=F][i,]),
                                      method=method,nthread=8)
        cos[is.na(cos)] <- 0
        dc[j,3] <- 1-mean(cos)
      }
      dt[i,1] <- scoring_dt[[id]][i]
      dt[i,2] <- mean(dc[,3])
    }

    View(dt)

I am looking to convert my code into more efficient code. I have tried foreach parallel loops but nothing seems to speed up my code.

**Note- I have mixed data character as well as binary(0 & 1) that is why I am using stringdist function. I can't use cosine function from lsa package.

Your for loops are indexed by `id`. Should this be `i` and `j` for the outer and inner loops respectively? — davechilders, Jan 26 '18 at 21:11
cz I want them to be based on id. I want each id from scoring_dt with its avg_cosine value respectively — Rushabh Patel, Jan 26 '18 at 21:14
I don't believe your example is reproducible. `id` is not an object. — davechilders, Jan 26 '18 at 21:21
@davechilders the code I have written is producing what exactly I am expecting it to be. I am looking for more efficient way to do it. I think I am not getting what u mean to explain. — Rushabh Patel, Jan 26 '18 at 21:24
@davechilders I am sorry for that. I forgot to mention id variable. Check for edits. — Rushabh Patel, Jan 26 '18 at 21:25

Efficient way to calculate cosine similarity by ignoring for loop

0 Answers0