0

It might be similar question would have asked in this forum but I feel my requirement is peculiar .

I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with 17,000 observations

I need to calculate the similarity between "written Term" and "suggestedterms", I am using the Stringdist package but this approach is taking quite a long as we have more observations.

df1$WrittenTerms

head pain

lung cancer

abdminal pain

df2$suggestedterms

cardio attack

breast cancer

abdomen pain

head ache

lung cancer

I need to get the output as follow

df1$WrittenTerms df2$suggestedterms Similarity_percentage

head pain head ache 50%

lung cancer lung cancer 100%

abdminal pain abdomen pain 80%

I am writing the below code to meet the requirement but its taking more time as it involves for loop and is there any way where we can find similarity using TF IDF OR any other approach which will take less time

df_list <- data.frame(check.names = FALSE) # Creating empty dataframe

# calculating similarity between strings.

for(i in df1$WrittenTerms){
  df2$oldsim<- stringdist(i,df2$suggestedterms,method = "lv")
  df2$oldsim <- 1 - df2$oldsim / nchar(as.character(df2$suggestedterms))
  df2 <- head(df2[order(df2$oldsim, decreasing = TRUE),],1)
  df_list <- rbind(df_list, df2)
}

df1 <- cbind(df1, df_list)
  • Isn't this a repeat of the question you asked a few days ago? https://stackoverflow.com/questions/58485947/calculating-similarity-between-two-vectors-strings-in-r – user2474226 Oct 25 '19 at 10:14
  • Thanks ,but that approach is taking quite a long time, Hence I am trying to use the text2vec/tfidf for finding similarity as I understand that this approach will take less time. Now I have matrix with 231 rows and 70,098 clummns(this is sample)with the similarity value in a matrix form. But with below approach I am unable to unstack it as I have huge number of columns(70098) Do you have a easy wasy to do it using tfidf which should take less timeI I am thinking to do below steps: unstack the matrix after conevrting the matrix into dataframe Sort (similarity,ver) drop dupliactes – Pavan kumar Oct 26 '19 at 15:23

0 Answers0