0

I am trying to calculate cosine similarity using stringdist function from stringdist package in R. I want to get average cosine similarity for each row in scoring_dt by calculating cosine similarity with each row of baseline_dt and taking mean for all the values. I am successfully getting results using below code. But, I am looking for much efficient code because below nested for loop is very slow for large dataset.

 baseline_dt <- read.table(text="id Product.Group.Code   R1   R2   R3   R4   S1   S2   S3   U1   U2   U3 U4 U6
    91  65418                164 0.68 0.70 0.50 0.59   NA   NA 0.96   NA 0.68   NA NA NA
    93  57142                164   NA 0.94   NA   NA 0.83   NA   NA 0.54   NA   NA NA NA
    99  66740                164 0.68 0.68 0.74   NA 0.63 0.68 0.72   NA   NA   NA NA NA
    100 76712                164 0.54 0.54 0.40   NA 0.39 0.39 0.39 0.50   NA 0.50 NA NA
    101 56463                164 0.67 0.67 0.76   NA   NA 0.76 0.76 0.54   NA   NA NA NA
    125 11713                164   NA   NA   NA   NA   NA 0.88   NA   NA   NA   NA NA NA",header=TRUE)


 scoring_dt <- read.table(text="id Product.Group.Code   R1   R2   R3   R4   S1   S2   S3   U1   U2   U3 U4 U6
11  999                164 0.68 0.70 0.50 0.59   0.7   NA 0.96   NA 0.68   NA NA NA
22  555                164   0 0.94   0   NA 0.83   0.6   NA 0.54   NA   NA NA NA",header=TRUE)

Please find R code below.

dc  <- setNames(data.frame(matrix(ncol = 3, nrow = 0)), c("baseline_id", "scoring_id", "cosine_score"))
    dt  <- setNames(data.frame(matrix(ncol = 2, nrow = 0)), c("scoring_id", "Avg_cosine_score"))
    predictor <- c("R1" ,"R2" ,"R3" ,"R4", "S1", "S2", "S3", "U1", "U2" ,"U3", "U4" ,"U6")

    id <-"id"
    baseline_dt <- data.table::setDT(baseline_dt)
    scoring_dt <- data.table::setDT(scoring_dt)

    for(i in 1:length(scoring_dt[[id]])){

      for(j in 1:length(baseline_dt[[id]])){

        dc[j,1] <- baseline_dt[[id]][j]
        dc[j,2] <- scoring_dt[[id]][i]
        cos <- stringdist::stringdist(as.character(baseline_dt[ ,predictor ,with=F][j,]),as.character(scoring_dt[,predictor,with=F][i,]),
                                      method=method,nthread=8)
        cos[is.na(cos)] <- 0
        dc[j,3] <- 1-mean(cos)
      }
      dt[i,1] <- scoring_dt[[id]][i]
      dt[i,2] <- mean(dc[,3])
    }

    View(dt)

I am looking to convert my code into more efficient code. I have tried foreach parallel loops but nothing seems to speed up my code.

**Note- I have mixed data character as well as binary(0 & 1) that is why I am using stringdist function. I can't use cosine function from lsa package.

Rushabh Patel
  • 2,672
  • 13
  • 34

0 Answers0