R regex to get partly match

Question

I want to use stri_replace_all_regex to replace string but failed. I would like to know whether there are other methods to overcome it. Thanks for anyone who gives help to me!

try: the first:

> library(string)
> a <- c('abc2','xycd2','mnb345','tumb b~','lymavc') 
> b <- c('ab','abc','xyc','mnb','tum','mn','tumb','lym','lymav') 
> stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)

However, the result is :

> c("ab","xyc","mn" ,"tum b~","lym")

which is not I want. I want the result should be:

> c('abc','xyc','mnb','tumb','lymac')

the second:

> pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
> gsub(pattern, "\\w", a)

However it failed. I feel sorry it's my mistake that I do not express clearly. In fact, I want to replace b with a. As you see, a and b have some similar parts on the left， I want to remove the difference from a. But should be greedy match. For example: The result of 'tumb b~‘ should be 'thumb' not 'tum' and the result of 'mnb345‘ should be 'mnb' not 'mn'. I just learn regex expresion, so my try may be complex and cumbersome. Looking forward for your reply!

A new questions occurs.

a <- c('tums310','tums310~20','tums320')  
b<-c('tums1','tums2','tums3')

I want the result should be

"tums3" "tums3" "tums3"

Could you explain in words what your rules are? Your pattern `"\\b" %s+% b %s+% "\\S+"` looks for a word starting with one of your `b` patterns and replaces it with the `b` pattern. It's certainly confusing to me when you have overlapping `b` patterns like `ab` and `abc` - maybe you want to rewrite them so the extensions are optional like `abc?` or maybe not. I'm not sure. It's also confusing that for the input `'tumb b~'` the expected output is `''tumb'` because the `\\S+` in your pattern is specifically **not** replacing spaces, but here you want to replace the space? — Gregor Thomas, Feb 14 '23 at 02:57
I feel sorry it's my mistake that I do not express clearly. In fact, I want to replace b with a. As you see, a and b have some similar parts on the left， I want to remove the difference from a. But should be greedy match. For example, the result of 'tumb b~‘ should be 'thumb' not 'tum' and the result of 'mnb345‘ should be 'mnb' not 'mn'. I just learn regex expresion, so my try may be complex and cumbersome. Looking forward for your reply! — flora micy, Feb 14 '23 at 04:26

GKi · Answer 1 · 2023-02-14T15:27:21.880

2

Maybe you are looking for adist.

a <- c('abc2','xycd2','mnb345','tumb b~','lymavc') 
b <- c('ab','abc','xyc','mnb','tum','mn','tumb','lym','lymav')
b[apply(adist(b, a) + adist(b, a, partial=TRUE), 2, which.min)]
#[1] "abc"   "xyc"   "mnb"   "tumb"  "lymav"

a <- c('tums310','tums310~20','tums320')  
b <- c('tums1','tums2','tums3')
b[apply(adist(b, a) + adist(b, a, partial=TRUE), 2, which.min)]
#[1] "tums3" "tums3" "tums3"

edited Feb 14 '23 at 15:27

answered Feb 14 '23 at 08:37

GKi

37,245
2
26
48

Thanks for your timely help. A new questions occurs. `a <- c('tums310','tums310~20','tums320') ` `b<-c('tums1','tums2','tums3') ` `b[apply(adist(b, a), 2, which.min)]` the result is: `[1] "tums1" "tums1" "tums2"` however I want the result should be `"tums3" "tums3" "tums3"` – flora micy Feb 14 '23 at 11:35
I updated using in addition `partial` to find the desired result. – GKi Feb 14 '23 at 15:28
Thank you very much!! I will spend time in learning these. Good wishes for you! – flora micy Feb 15 '23 at 02:02

Chris Ruehlemann · Answer 2 · 2023-02-14T12:12:14.247

0

Here's a fuzzy_join solution with the function stringdist_join:

library(fuzzyjoin)
stringdist_join(
  # join `b` as a dataframe ... 
  data.frame(b),
  # ... with `a` as a dataframe:
  data.frame(a),
  # join by ...:
  by = c("b" = "a")
  # use left join:
  mode = 'left',
  # use Jaro-Winkler distance metric:
  method = "jw",
  # enable case-insensitive matching:
  ignore_case = TRUE,
  # name for distance column:
  distance_col = 'dist') %>% 
# retain only closest matches:
group_by(a) %>%
  slice_min(order_by = dist, n = 1)
# A tibble: 5 × 3
# Groups:   a [5]
  b     a         dist
  <chr> <chr>    <dbl>
1 abc   abc2    0.0833
2 lymav lymavc  0.0556
3 mnb   mnb345  0.167 
4 tumb  tumb b~ 0.143 
5 xyc   xycd2   0.133

b contains now the most closely matching values for a.

edited Feb 14 '23 at 12:12

answered Feb 14 '23 at 09:13

Chris Ruehlemann

20,321
4
12
34

Does the answer help you? if it does, consider accepting and/or upvoting it. – Chris Ruehlemann Feb 14 '23 at 12:07
Have edited the answer (it contained a mistake resulting from an earlier version of the code) – Chris Ruehlemann Feb 14 '23 at 12:12
Thank you very much! I will have a try you. Good wishes for you! – flora micy Feb 15 '23 at 02:07

R regex to get partly match

2 Answers2