1

I am trying to perform a join in R based on a regex pattern from one table. From what I understand, the fuzzyjoin package should be exactly what I need, but I can't get it to work. Here is an example of what I'm trying to do:

library(tidyverse)
library(fuzzyjoin)

(typing_table <- tribble(
  ~typing,
  "DPB02:01",
  "DPB04:02"
)
)

(P_group_table <- tribble(
  ~P_group, ~Alleles,
  "DP1", "DPB01:01:01:01/DPB01:01:01:02/DPB01:01:01:03",
  "DP2", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03",
  "DP3", "DPB03:01:01:01/DPB03:01:01:02/DPB03:01:01:03",
  "DP4", "DPB04:01:01:01/DPB04:01:01:02/DPB04:01:01:03"
)
 )

I am trying to join the P_group_table to the typing_table by searching for the "typing" value in the "Alleles" string. I have used the following expression:

(typing_table %>% regex_left_join(P_group_table, by = c("typing" = "Alleles")))

Which results in a join, but the values from the right table are empty. I assume I must be misunderstanding the syntax of the regex_left_join expression, but I can't figure it out. I have verified that the "typing" value can be used as a regex pattern with the following code:

(typing_table_2 <- typing_table %>% slice_head)

(P_group_table %>% filter(str_detect(Alleles, typing_table_2$typing)))
Nick Brown
  • 55
  • 6
  • Yes, there should only be a single match based on the data. The right join works, but how would I set it up for a left join? – Nick Brown Aug 03 '22 at 16:04

1 Answers1

0

We could make use of str_detect as match_fun in fuzzy_.*_join

library(fuzzyjoin)
library(stringr)
fuzzy_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"), 
       match_fun = str_detect)

Or may use

regex_right_join(P_group_table, typing_table, by = c("Alleles" = "typing"))
# A tibble: 2 × 3
  P_group Alleles                                      typing  
  <chr>   <chr>                                        <chr>   
1 DP2     DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03 DPB02:01
2 <NA>    <NA>                                         DPB04:02

Note the difference when we switch the pattern

> str_detect("DPB02:01", "DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03")
[1] FALSE
> str_detect("DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03", "DPB02:01")
[1] TRUE

One option to do the left_join is by getting the substring from the 'P_group_table' before doing the join

left_join(typing_table, P_group_table %>%
     mutate(typing = substr(Alleles, 1, 8)), by = "typing")
# A tibble: 2 × 3
  typing   P_group Alleles                                     
  <chr>    <chr>   <chr>                                       
1 DPB02:01 DP2     DPB02:01:02:01/DPB02:01:02:02/DPB02:01:02:03
2 DPB04:02 <NA>    <NA>          
akrun
  • 874,273
  • 37
  • 540
  • 662