fuzzy_left_join with match_fun %in%

Question

Some data

example_df <- data.frame(
  url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
  numbs = 1:5
)

lookup_df <- data.frame(
  string = c('blog', 'subscription', 'UK'),
  group = c('blog', 'subs', 'UK')
)


library(fuzzyjoin)
data_combined <- example_df %>% 
  fuzzy_left_join(lookup_df, by = c("url" = "string"), 
                  match_fun = `%in%`)

data_combined
                   url numbs string group
1            blog/blah     1   <NA>  <NA>
2 blog/?utm_medium=foo     2   <NA>  <NA>
3                 blah     3   <NA>  <NA>
4  subscription/apples     4   <NA>  <NA>
5         UK/something     5   <NA>  <NA>

I expected data_combined to have values for string and group where there's a match based on match_fun. Instead all NA.

Example, the first value of string in lookup_df is 'blog'. Since this is %in% the first value of example_df string, expected a match with value 'blog' and 'blog' in string and group fields.

Is this always looking for a match before `/`, then `example_df %>% mutate(string = str_remove(url, "\\/.*")) %>% left_join(lookup_df)` — akrun, Mar 03 '21 at 19:17
RE your first comment, when trying like I get `Error in fuzzy_join(x, y, by, match_fun, mode = "left", ...) : object '%like%' not found` — Doug Fir, Mar 03 '21 at 19:18
sorry, it is from data.table. I think the second comment should work if the pattern you are matching is before the `/` — akrun, Mar 03 '21 at 19:18
RE your second comment the only thing I know for sure with this data is that the string in lookup_df will always be a string same as or shorter than it's equivalent in example_df url? — Doug Fir, Mar 03 '21 at 19:19
Ah! Loading data.table got it to work but introduced another issue. It duplicates rows where the match is more than once in the lookup. Argh, today is a tough one.. There's no flag you know of along the lines 'just use the first match'? — Doug Fir, Mar 03 '21 at 19:21
it is duplicating. I noticed that. Can you try the code in the solution posted — akrun, Mar 03 '21 at 19:22

score 2 · Accepted Answer · answered Mar 03 '21 at 19:21

2

If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join

library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
    mutate(string = str_remove(url, "\\/.*")) %>% 
    regex_left_join(lookup_df, by = 'string') %>%
    select(url, numbs, group)

-output

#                   url numbs group
#1            blog/blah     1  blog
#2 blog/?utm_medium=foo     2  blog
#3                 blah     3  <NA>
#4  subscription/apples     4  subs
#5         UK/something     5    UK

answered Mar 03 '21 at 19:21

akrun

874,273
37
540
662

1

Perfect, this works and does not seem to duplicate! Why would it not duplicate? Welp, not complaining – Doug Fir Mar 03 '21 at 19:24
@DougFir It would be the way the `%like%` is matching. Also, it is better to do the match more directed instead of the whole url – akrun Mar 03 '21 at 19:25

fuzzy_left_join with match_fun %in%

1 Answers1