0

I am working on data that contains a list of treatment providers in one file and rural-urban codes in another file. The ultimate goal is to link the rural-urban codes to the treatment centers' locations via county and state locations. I have tried to do this with fuzzy join but the results I am getting back are very wrong and I"m not sure how to correct them. I've tried to condense exemplars of the issues I"m facing into a toy dataset:

df1 = data.frame(state = c("MO", "AK", "AR","LA","MN", "MN", "OK", "MI", "VA","KS","TX"),
                 county = c("Saint Louis City","Prince of Wales Hyder", "Lincoln",
                            "Acadia","Saint Louis","Crow Wing", "Lincoln", 
                            "Wayne", "Fauquier County", "Anderson","Anderson"))

df2 = data.frame(State = c("MO", "AK", "AR","LA","MN", "MN", "OK","MI", "VA", "KS","TX"),
                County_Name = c("Saint Louis City","Price of Wales Hyder Census Area","Lincoln County", 
                                "Acadia Parish", "St. Louis County", "Crow Wing County", "Lincoln County",
                                "Wayne County","Fauquier County","Anderson County","Anderson County"),
                Codes = seq(1,11))

I've tried to get a fuzzy join to work looking at other forums such as:

test = stringdist_join(df1, df2,
                 by = c("state" = "State", "county" = "County_Name"),
                 mode = "left",
                 method = "lcs",
                 max_dist = c(0,10),
                 ignore_case = TRUE,
                 distance_col='dist') 

However, most of the time this does not even get the state correct which I thought I forced to be an exact match with a max_dist=0 and it runs very slowly. I've played around with the max_dist argument and a few methods, but thought longest common string would work well as most of the time the issue is that County_Name has more characters than county (e.g., county = Anderson while County_Name = Anderson County). There are also several suffix variants in the County_Name column such as county, municipality, city, etc. and not all county names are one word (e.g., "Hot Springs County") that I've tried to represent in this toy. The expected output is that the correct code will be joined to df1's correct state and county.

Can someone please help me figure out a way to achieve this join? It doesn't have to be this method but I'm inexperienced with other solutions like regexpress.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
aarsmith
  • 65
  • 6

0 Answers0