Computing similarity % in text strings by excluding the identical entries in R

Question

the given R script computes the similarity in % between two names as shown in the visual. Here we have two columns "names1" and "names2" with their respective ids in id1 and id2. My requirement is that when we execute the script, each name in "names1" gets compared with each name in "names2" column, I do not want the same entry i.e. (id1,names1) column to be compared with its identical entry in (id2,names2) column. For Illustration, the first (id1,names1) entry (1,Prabhudev Ramanujam) should get compared with all (id2,names2) but not with the first (id2,names2) entry. Similarly for all pairs. Also, if the formula

percent(sapply(names1, function(i)RecordLinkage::levenshteinSim(i,names2)))

can be tweaked to produce a similar and faster result here as it slows down on large data, Attaching the snapshot, please help.

library(stringdist)
library(RecordLinkage)
library(dplyr)
library(scales)
id1    <- 1:8 
names1 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer 
Mahapatra","SriramKishore Sharma",
        "Deepak Subramaniam","SriramKishore Sharma","Deepak 
Subramaniam","Sangamer Mahapatra")
id2    <- c(1,2,3,4,11,13,9,10)
names2 <- c("Prabhudev Ramanujam","Deepak Subramaniam","Sangamer 
Mahapatra","SriramKishore Sharma",
        "Deepak Subramaniam","Sangamer Mahapatra","SriramKishore 
Sharma","Deepak Subramaniam")
Name_Data <- data.frame(id1,names1,id2,names2)
Percent<- percent(sapply(names1, function(i) 
RecordLinkage::levenshteinSim(i,names2)))
Total_Value <- data.frame(id2,names2,Percent)

When you compare one element with all other values, the output vector would be 56. — akrun, Jan 06 '18 at 11:33
@akrun, thanks for replying here, I think 60, as when your run the Name_Data variable, you see four values are identical, so they should get excluded, resultant vector showing 60 values. — Adam Shaw, Jan 06 '18 at 11:39
Perhaps `i1 <- seq_len(nrow(Name_Data)); percent(sapply(i1, function(i) RecordLinkage::levenshteinSim(names1[i], names2[setdiff(i1, i)])))` — akrun, Jan 06 '18 at 11:46
`percent(unlist(lapply(1:length(names1), function(x) levenshteinSim(names1[x], names2[-x]))))` this will be a bit faster — erocoar, Jan 06 '18 at 12:10
@erocoar, Thanks for the help, but as per my requirement, I do want to check the similarity based on id's too, in this solution if I change the id somewhere, say last value of "id2", from "10" to "2", that should result in less entries than 56, currently, it is giving 56 entries in all. Kindly suggest. — Adam Shaw, Jan 06 '18 at 12:35
@akrun, appreciate your effort here, but please help me to resolve the problem based on the id's here too. — Adam Shaw, Jan 06 '18 at 12:42
@AdamShaw Perhaps `outer` could be faster `percent(outer(names1, names2, FUN= RecordLinkage::levenshteinSim))` or do a crossjoin `library(data.table); CJ(names1, names2)[, percent(RecordLinkage::levenshteinSim(V1, V2))]` — akrun, Jan 06 '18 at 12:50
@akrun, thanks for a faster solution, but again as per my requirement, kindly suggest an approach which first checks for clone (id1,names1) and (id2,names2) pairs, exludes such pairs while computing and then finally gives a result, according to me, a total of 60 entries I should be getting here. Thanks. — Adam Shaw, Jan 06 '18 at 12:59
Sorry, I am not getting the 60 entries. As per your logic, it can be 56 by excluding the one that you are not comparing — akrun, Jan 06 '18 at 13:01
@erocoar, It can be, I'll be very clear here, if every (id1,names1) pair has a common (id2,names2) pair, I want to exclude such entries in % computation, rest every (id1,names1) pair has to be compared with all other (id2,names2) pairs, also, kindly avoid using if or loops to achieve this, thanks. — Adam Shaw, Jan 06 '18 at 13:37
@akrun, kindly check the tables above, names1 and names2 value might be same, but the id1 and id2 values make them different, I do not want clone pairs to be compared, but distinct pairs, that way, if you check here, a total of 60 values will be achieved. Only the first 4 "names1" values are having clone in "names2" column based on id's — Adam Shaw, Jan 06 '18 at 13:41

erocoar · Accepted Answer · 2018-01-07T00:19:05.630

1

Not much faster, but my suggestion would be:

percent(unlist(lapply(1:length(names1), function(x) {
  levenshteinSim(names1[x], names2[!(names2==names1[x] & id2==id1[x])])})))

Edit:

Alternatively, this might be faster - I guess it varies:

as.vector(t(1 - (stringdistmatrix(names1, names2, method = "lv") / 
         outer(nchar(names1), nchar(names2), pmax))))[unlist(lapply(1:length(names1), function(x) !(names2==names1[x] & id2==id1[x])))]

edited Jan 07 '18 at 00:19

answered Jan 06 '18 at 13:35

erocoar

5,723
3
23
45

thanks, however, if something can be done to make the script fast, kindly help. – Adam Shaw Jan 06 '18 at 13:59
You can speed it up with `paste(round(unlist(lapply(1:length(names1), function(x) { levenshteinSim(names1[x], names2[!(names2t==names1t[x] & id2t==id1t[x])])}))*100, 1), "%", sep="")` rather than calling `percent`. Most of the other computation time comes from calling `levenshteinSim`, where I am not sure of how to speed it up – erocoar Jan 06 '18 at 15:00

Computing similarity % in text strings by excluding the identical entries in R

1 Answers1