0

I want to get percentage character match between two strings/two columns consisting of names in my dataframe. If below can be achieved using sqldf it will be helpful. Below is an example which i want to achieve in one of my columns of the dataframe.

FAYE comparing to FAYE2

output will be 90%

The below formula to be used :

total characters (adding length of 1st and 2nd string) = 9 Matched characters multiplied by 2 divided by total characters = (4 x 2) / 9 *** we multiply matched characters by 2 as there are 2 strings 8/9 = 88.88 % or 90%

Thanks

1 Answers1

1

We assume from the example in the question that we want to determine whether the first string is a substring of the second string or visa versa and if so report the ratio of their lengths and report 0 otherwise. Also the ratio of the lengths in the example is 100 * 4 / 5 = 80%, not 90% as shown in the question.

# test data
DF <- data.frame(string1 = c("FAYE", "FAYE2", "X"), 
                 string2 = c("FAYE2", "FAYE", "FAYE"), stringsAsFactors = FALSE)

library(sqldf)

sqldf("select *, 
  max(100.0 * (instr(string2, string1) > 0) * length(string1) / length(string2),
      100.0 * (instr(string1, string2) > 0) * length(string2) / length(string1))
      percent from DF")

giving:

  string1 string2 percent
1    FAYE   FAYE2      80
2   FAYE2    FAYE      80
3       X    FAYE       0
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for the codes. But i have been asked to use the below formula to calculate percentage:total characters (adding length of 1st and 2nd string) = 9 Matched characters multiplied by 2 divided by total characters = (4 x 2) / 9 *** we multiply matched characters by 2 as there are 2 strings 8/9 = 88.88 % or 90% – Gautam Biswas Feb 28 '19 at 06:05
  • If i compare string1="DUCK THRU" with string2="JERNIGAN OIL CO., INC." it gives me 0% but it shouldn't be. There are few characters matching between two strings – Gautam Biswas Feb 28 '19 at 06:21