0

I have two dataframes:

df1 <- data.frame(City=c("Munchen_Paris","Munchen_Paris","Barcelona_Milan", "Londen_Dublin","Madrid_Malaga"), 
                  value1=c(11,21,33,2,53))

df2 <- data.frame(City=c("Munnich_Parijs","Barcelona_Munster","Barcelona_Milan","London_Dub","London_Oxford","Pisa_Luik"), 
                  value2=c(22,2,44,54,29,65))

I try to merge these dataframes with fuzzyjoin.

The result I am looking for is:

           City.x  value1   City.y             value2  string_distance
1   Munchen_Paris      11   Munnich_Parijs     22      5
2   Munchen_Paris      21   Munnich_Parijs     22      5
3 Barcelona_Milan      33   Barcelona_Milan    44      0
4   Londen_Dublin       2   London_Dub         54      4

(for every row in df1 with a match in df2 for City with a string_distance < 9, I want a row in the new table containing all columns from df1 and df2 with the lowest string_distance) When I do:

df3 <- stringdist_semi_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")

I receive only these columns:

> df3
             City value1
1   Munchen_Paris     11
2   Munchen_Paris     21
3 Barcelona_Milan     33
4   Londen_Dublin      2

If I do a full join I receive this:

> df3 <- stringdist_full_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")

> df3
           City.x value1            City.y value2 string_distance
1   Munchen_Paris     11    Munnich_Parijs     22               5
2   Munchen_Paris     21    Munnich_Parijs     22               5
3 Barcelona_Milan     33 Barcelona_Munster      2               6
4 Barcelona_Milan     33   Barcelona_Milan     44               0
5   Londen_Dublin      2        London_Dub     54               4
6   Londen_Dublin      2     London_Oxford     29               7
7   Madrid_Malaga     53              <NA>     NA              NA
8            <NA>     NA         Pisa_Luik     65              NA

I can delete the rows containing NA and group_by City.x although then I loose one of the first two rows.

If I do inner_join I receive this:

    df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")

df3

> df3
           City.x value1            City.y value2 string_distance
1   Munchen_Paris     11    Munnich_Parijs     22               5
2   Munchen_Paris     21    Munnich_Parijs     22               5
3 Barcelona_Milan     33 Barcelona_Munster      2               6
4 Barcelona_Milan     33   Barcelona_Milan     44               0
5   Londen_Dublin      2        London_Dub     54               4
6   Londen_Dublin      2     London_Oxford     29               7

Is it strange that stringdist_semi_join does not shows the columns of df2? Is there another way to reach the result I am looking for in the first table above?

Thanks a lot!

user2165379
  • 445
  • 4
  • 20

1 Answers1

0

What a semi join does (from the dplyr documentation):

return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

So the behavior you're seeing is expected.

You are looking for an inner join:

return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

In fuzzyjoin this is called stringdist_inner_join:

df3 <- stringdist_inner_join(df1, df2, by = "City", max_dist = 9, distance_col = "string_distance")
Bas
  • 4,628
  • 1
  • 14
  • 16
  • @ Bas Thanks! Although I was not clear enough in my question (I have modified my question and added inner_join output as well). I am only looking for the best match in df2 (only the match with the lowest string_distance) for every row in df1. Semi join does that, although it does not provide the columns from df2. – user2165379 Apr 10 '20 at 19:26
  • Semi join only keeps the rows of df1 that have a match. It does not look for a 'best match'. As far as I know, there is no way to achieve what you want with only a join. I'd stick to the `inner_join` and filter out the smallest value of `distance_col` for every `City.x`. – Bas Apr 11 '20 at 06:58
  • 1
    @ Bas. Thanks for your help. I have sticked to inner_join and added a column with row number as id. Next I added this filter df3 %>% dplyr::group_by(id.x) %>% slice(which.min(string_distance)). – user2165379 Apr 11 '20 at 07:44