Anti_join by pattern

Question

Tibble A contains keys/patterns that may be found in tibble B. My goal is to identify keys/patterns that were not found in tibble B.

I want to anti-join two tables by a matching pattern:

A <- tibble( colA = c("B12", "B19", "B202", "B87", "B61", "B55") )
B <- tibble( colB = c("aaB87aa", "bbbbB55", "B202cccc") )

I found that function from fuzzyjoin should work but I can't use it properly:

test <- regex_anti_join(A, B, by = c(colB = "colA"))

EDIT

At the end I would like to have something like this:

test
1 B12
2 B19
3 B61

Looks like there is a pattern, maybe using regex extract id as "starts with B followed by digits until a letter", then use that id as join column. — zx8754, Jul 17 '20 at 07:42
@YuriySaraykin it gives all `A` tibble as the output - it's not anti-join — amk, Jul 17 '20 at 07:58

score 2 · Accepted Answer · answered Jul 17 '20 at 08:26

regex_anti_join(A, B, by = c(colA = colB)) performs a str_detect of colB inside colA. In your case this return nothing, hence the empty result.

What you need is to do the inverse: search colA inside colB.

This can be achieved specifying the matching function to the fuzzy_join function that is used by regex_anti_join:

library(fuzzyjoin)

# This is exactly the inverse of the match_fun used inside regex_anti_join
match_fun <- function(v1, v2) {
  stringr::str_detect(v2, stringr::regex(v1))
}

fuzzy_join(A, B, by = c("colA" = "colB"), match_fun=match_fun, mode = 'anti')
#> # A tibble: 3 x 1
#>   colA 
#>   <chr>
#> 1 B12  
#> 2 B19  
#> 3 B61

^{Created on 2020-07-17 by the reprex package (v0.3.0)}

score 1 · Answer 2 · answered Jul 17 '20 at 08:23

1

One dplyr option given the example data could be:

A %>%
 filter(!colA %in% gsub("[a-z]", "", B$colB))

  colA 
  <chr>
1 B12  
2 B19  
3 B61

answered Jul 17 '20 at 08:23

tmfmnk

38,881
4
47
67

Anti_join by pattern

2 Answers2