1

I have two dataframes, and want to add values from the 2nd one to the 1st one according to string values, but use partial string matching if there is a space

df1:
cat
small dog
apple

df2:
cat  24
small 5
dog  400
apple 83
pear 55


I normally use "left_join" from tidyverse, which would be

df3 <- left_join(df1, df2, by="column_name")

df3:
cat 24
small dog NA
apple 83

but this means that "small dog" has a missing value. What I want to do this time is find the value for either "small" or "dog", and input whichever is bigger. I'm not able to find a function that will tell R to look separately before or after the space though

user438383
  • 5,716
  • 8
  • 28
  • 43
Agata
  • 343
  • 3
  • 13

2 Answers2

1

We may use regex_left_join from fuzzyjoin and then do a group by to summarise the second column with max values

library(dplyr)
library(fuzzyjoin)
regex_left_join(df1, df2, by = "column_name") %>% 
 group_by(column_name = column_name.x) %>% 
 summarise(col2 = max(col2))

-output

# A tibble: 3 × 2
  column_name  col2
  <chr>       <dbl>
1 apple          83
2 cat            24
3 small dog     400

data

df1 <- structure(list(column_name = c("cat", "small dog", "apple")), 
class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(column_name = c("cat", "small", "dog", "apple", 
"pear"), col2 = c(24, 5, 400, 83, 55)), class = "data.frame", row.names = c(NA, 
-5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • couldn't test it because R crashed whenever I tried... but thanks – Agata Jan 24 '22 at 21:14
  • @Agata is the testing done on the same example or a very big dataset? if it is big, probably memory may be an issue. Or else you may have to check whether the packages are installed correctly – akrun Jan 24 '22 at 21:15
  • yes it was on a larger dataset and it seemed that it was a memory issue, but the solution below worked so its ok – Agata Jan 24 '22 at 21:17
1

Another possible solution, based on inner_join:

library(tidyverse)

df1 %>% 
  mutate(spaces = row_number()*str_detect(column_name, " ")) %>% 
  separate_rows(column_name, sep = " ") %>% 
  inner_join(df2, by="column_name") %>% 
  group_by(spaces) %>% 
  mutate(col2 = if_else(spaces > 0, max(col2), col2),
         column_name = if_else(spaces > 0, str_c(column_name, collapse = " "), 
                       column_name)) %>% 
  ungroup %>% distinct %>% select(-spaces)

#> # A tibble: 3 × 2
#>   column_name  col2
#>   <chr>       <dbl>
#> 1 cat            24
#> 2 small dog     400
#> 3 apple          83
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • 1
    Thanks! this works if both parts of the string are present in df2, but if one is missing then it selects NA as max value. I added mutate(col2 = replace_na(col2, 0))%>% after line 5 (before the conditional statement) and then it worked perfectly – Agata Jan 24 '22 at 21:13
  • 1
    Great, @Agata! And glad I helped you! – PaulS Jan 24 '22 at 21:18