2

I am having below mentioned data in R dataframe:

DF

structure(list(ID = c("VVC-110", "VVC-111", "VVC-111", "VVC-112", 
"VVC-113"), Add = c("255 3RD FLOOR A SQUARE PLOT NO 10 POCKET 4 SECTOR 11 ", 
"7045 Liberty Ave. Gastonia, Rose Street ", "22 S. Holly St. \nWinter Garden,.", 
"9416 Washington St. \nStafford, Leatherwood Circle", "466 Pawnee Street \nSicklerville,Ridgeview Court \nMundelein,.."
), State = c("Alabama", "Alaska", "Arizona ", "California ", 
"Colorado"), City = c("Birmingham", "Anchorage", "Phoenix", "Los Angeles", 
"Denver"), Zipcode = c(58765L, 75974L, 98052L, 89406L, 12421L
), Add_1 = c("255, 3rd FLOOR A SQUARE PLOT NO.10 POCKET 4 SECTOR 11, ", 
"7045 Liberty Ave. Gastonia, Rose Street View, New", "22 S. Holly St. \nWinter Garden,.", 
"9416, Washington St., \nStafford, Leather Wood", "466 Pawnee Street \nSicklerville"
), State_1 = c("Alabama", "Alaskaa", "Arizona", "California", 
"Colorado"), City_1 = c("Birmingham", "Anchorage", "Phoenix", 
"LosAngeles", "Den ver"), Zipcode_1 = c(58765L, 75974L, 98052L, 
89406L, 12421L)), class = "data.frame", row.names = c(NA, -5L
))

By utilizing the above mentioned dataframe, I want to determine the % match of particular two strings on which i can be assure that for how many rows there are two fields/column are likely same.

  • % of String Match between Add and Add_1.
  • % of String Match between State and State_1.

Disclaimer: All the % shown in the Required Output Dataframe are random, that can be vary based on the logic and approach.

Jupiter
  • 221
  • 1
  • 12
  • You define match as strings on corresponding rows are identical? – s_baldur Aug 09 '18 at 09:44
  • @snoram I want to get these % using fuzzy logic and levenshtein distance. – Jupiter Aug 09 '18 at 09:51
  • R has great printing capibilities (especially for dataframes). Maybe try to update your code output to be more readable). It will greatly improve question quality! – franiis Aug 09 '18 at 09:53
  • 1
    use the `stringdist` package `stringdist::stringdist(DF$Add,DF$Add_1,method="lv")` gives your the levenshtein distance, you can divide by `nchar(DF$Add` for a percentage. – Michael Bird Aug 09 '18 at 09:55
  • 1
    Whilst not directly related to the question, you should keep in mind that `123 fake street` has a levenshtein distance of 1 with `124 fake street` but also `123, fake street` so a low levenshtein distance does not always imply a correct address. it might be worth defining your own metric that weights digits with more worth. – Michael Bird Aug 09 '18 at 10:06

1 Answers1

4

I am using that approach for levenshtein distance (added suggestion of @Michael Bird):

 library(RecordLinkage)
 library(dplyr)
 df %>% 
  mutate(levi_add = levenshteinDist(Add, Add_1),
         levi_state = levenshteinDist(State, State_1),
         n_char_add = nchar(Add), 
         n_char_State = nchar(State),
         levi_add_percent = 100-round(levi_add/n_char_add*100, digits = 1),
         levi_state_percent = 100-round(levi_state/n_char_State*100, digits = 1)) %>% 
  select(ID, levi_add_percent, levi_state_percent)

Output is:

       ID levi_add_percent levi_state_percent
1 VVC-110             90.6              100.0
2 VVC-111             77.5               83.3
3 VVC-111            100.0               87.5
4 VVC-112             77.6               90.9
5 VVC-113             50.8              100.0
Stephan
  • 2,056
  • 1
  • 9
  • 20
  • 2
    It's worth noting that a low percentage is good match and a high percentage is a poor match. Personally I'd use `100-round(levi/n_char*100, digits = 1))` – Michael Bird Aug 09 '18 at 10:03
  • This approach, however, is rather sensitive to minor unimportant differences like e.g. additional whitespaces and such. – Daniel Fischer Aug 09 '18 at 12:04
  • Is there a reason for not using the levenshteinSim() function? Seems to me it can be multiplied by 100 and gives the same result. Or am I missing something? "levenshteinSim is a similarity function based on the Levenshtein distance, calculated by 1 - d(str1,str2) / max(A,B), where d is the Levenshtein distance function and A and B are the lenghts of the strings." – Pomul Aug 20 '19 at 15:28