1

Some data

example_df <- data.frame(
  url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
  numbs = 1:5
)

lookup_df <- data.frame(
  string = c('blog', 'subscription', 'UK'),
  group = c('blog', 'subs', 'UK')
)


library(fuzzyjoin)
data_combined <- example_df %>% 
  fuzzy_left_join(lookup_df, by = c("url" = "string"), 
                  match_fun = `%in%`)

data_combined
                   url numbs string group
1            blog/blah     1   <NA>  <NA>
2 blog/?utm_medium=foo     2   <NA>  <NA>
3                 blah     3   <NA>  <NA>
4  subscription/apples     4   <NA>  <NA>
5         UK/something     5   <NA>  <NA>

I expected data_combined to have values for string and group where there's a match based on match_fun. Instead all NA.

Example, the first value of string in lookup_df is 'blog'. Since this is %in% the first value of example_df string, expected a match with value 'blog' and 'blog' in string and group fields.

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • 1
    I think you may need `%like%` instead of `%in%` – akrun Mar 03 '21 at 19:11
  • Is this always looking for a match before `/`, then `example_df %>% mutate(string = str_remove(url, "\\/.*")) %>% left_join(lookup_df)` – akrun Mar 03 '21 at 19:17
  • RE your first comment, when trying like I get `Error in fuzzy_join(x, y, by, match_fun, mode = "left", ...) : object '%like%' not found` – Doug Fir Mar 03 '21 at 19:18
  • sorry, it is from data.table. I think the second comment should work if the pattern you are matching is before the `/` – akrun Mar 03 '21 at 19:18
  • RE your second comment the only thing I know for sure with this data is that the string in lookup_df will always be a string same as or shorter than it's equivalent in example_df url? – Doug Fir Mar 03 '21 at 19:19
  • Ah! Loading data.table got it to work but introduced another issue. It duplicates rows where the match is more than once in the lookup. Argh, today is a tough one.. There's no flag you know of along the lines 'just use the first match'? – Doug Fir Mar 03 '21 at 19:21
  • 1
    it is duplicating. I noticed that. Can you try the code in the solution posted – akrun Mar 03 '21 at 19:22

1 Answers1

2

If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join

library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
    mutate(string = str_remove(url, "\\/.*")) %>% 
    regex_left_join(lookup_df, by = 'string') %>%
    select(url, numbs, group)

-output

#                   url numbs group
#1            blog/blah     1  blog
#2 blog/?utm_medium=foo     2  blog
#3                 blah     3  <NA>
#4  subscription/apples     4  subs
#5         UK/something     5    UK
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Perfect, this works and does not seem to duplicate! Why would it not duplicate? Welp, not complaining – Doug Fir Mar 03 '21 at 19:24
  • @DougFir It would be the way the `%like%` is matching. Also, it is better to do the match more directed instead of the whole url – akrun Mar 03 '21 at 19:25