How to match corresponding values to part of string (before and after space)?

Question

I have two dataframes, and want to add values from the 2nd one to the 1st one according to string values, but use partial string matching if there is a space

df1:
cat
small dog
apple

df2:
cat  24
small 5
dog  400
apple 83
pear 55

I normally use "left_join" from tidyverse, which would be

df3 <- left_join(df1, df2, by="column_name")

df3:
cat 24
small dog NA
apple 83

but this means that "small dog" has a missing value. What I want to do this time is find the value for either "small" or "dog", and input whichever is bigger. I'm not able to find a function that will tell R to look separately before or after the space though

`regex_left_join` from `fuzzyjoin` may be the one you are after — akrun, Jan 24 '22 at 20:01
Do you have two columns in the second dataset or is it a single string `"cat 24"` — akrun, Jan 24 '22 at 20:04

score 1 · Answer 1 · answered Jan 24 '22 at 20:07

1

We may use regex_left_join from fuzzyjoin and then do a group by to summarise the second column with max values

library(dplyr)
library(fuzzyjoin)
regex_left_join(df1, df2, by = "column_name") %>% 
 group_by(column_name = column_name.x) %>% 
 summarise(col2 = max(col2))

-output

# A tibble: 3 × 2
  column_name  col2
  <chr>       <dbl>
1 apple          83
2 cat            24
3 small dog     400

data

df1 <- structure(list(column_name = c("cat", "small dog", "apple")), 
class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(column_name = c("cat", "small", "dog", "apple", 
"pear"), col2 = c(24, 5, 400, 83, 55)), class = "data.frame", row.names = c(NA, 
-5L))

answered Jan 24 '22 at 20:07

akrun

874,273
37
540
662

couldn't test it because R crashed whenever I tried... but thanks – Agata Jan 24 '22 at 21:14
@Agata is the testing done on the same example or a very big dataset? if it is big, probably memory may be an issue. Or else you may have to check whether the packages are installed correctly – akrun Jan 24 '22 at 21:15
yes it was on a larger dataset and it seemed that it was a memory issue, but the solution below worked so its ok – Agata Jan 24 '22 at 21:17

PaulS · Accepted Answer · 2022-01-24T20:42:30.177

1

Another possible solution, based on inner_join:

library(tidyverse)

df1 %>% 
  mutate(spaces = row_number()*str_detect(column_name, " ")) %>% 
  separate_rows(column_name, sep = " ") %>% 
  inner_join(df2, by="column_name") %>% 
  group_by(spaces) %>% 
  mutate(col2 = if_else(spaces > 0, max(col2), col2),
         column_name = if_else(spaces > 0, str_c(column_name, collapse = " "), 
                       column_name)) %>% 
  ungroup %>% distinct %>% select(-spaces)

#> # A tibble: 3 × 2
#>   column_name  col2
#>   <chr>       <dbl>
#> 1 cat            24
#> 2 small dog     400
#> 3 apple          83

edited Jan 24 '22 at 20:42

answered Jan 24 '22 at 20:36

PaulS

21,159
2
9
26

1

Thanks! this works if both parts of the string are present in df2, but if one is missing then it selects NA as max value. I added mutate(col2 = replace_na(col2, 0))%>% after line 5 (before the conditional statement) and then it worked perfectly – Agata Jan 24 '22 at 21:13
1

Great, @Agata! And glad I helped you! – PaulS Jan 24 '22 at 21:18

How to match corresponding values to part of string (before and after space)?

2 Answers2

data