2

I want to use stri_replace_all_regex to replace string but failed. I would like to know whether there are other methods to overcome it. Thanks for anyone who gives help to me!

try: the first:

> library(string)
> a <- c('abc2','xycd2','mnb345','tumb b~','lymavc') 
> b <- c('ab','abc','xyc','mnb','tum','mn','tumb','lym','lymav') 
> stri_replace_all_regex(a, "\\b" %s+% b %s+% "\\S+", b, vectorize_all=FALSE)

However, the result is :

> c("ab","xyc","mn" ,"tum b~","lym")

which is not I want. I want the result should be:

> c('abc','xyc','mnb','tumb','lymac')

the second:

> pattern <- paste0("\\b(", b, ")\\S+", collapse = "|")
> gsub(pattern, "\\w", a)

However it failed. I feel sorry it's my mistake that I do not express clearly. In fact, I want to replace b with a. As you see, a and b have some similar parts on the left, I want to remove the difference from a. But should be greedy match. For example: The result of 'tumb b~‘ should be 'thumb' not 'tum' and the result of 'mnb345‘ should be 'mnb' not 'mn'. I just learn regex expresion, so my try may be complex and cumbersome. Looking forward for your reply!

A new questions occurs.

a <- c('tums310','tums310~20','tums320')  
b<-c('tums1','tums2','tums3')

I want the result should be

"tums3" "tums3" "tums3"

flora micy
  • 23
  • 6
  • Could you explain in words what your rules are? Your pattern `"\\b" %s+% b %s+% "\\S+"` looks for a word starting with one of your `b` patterns and replaces it with the `b` pattern. It's certainly confusing to me when you have overlapping `b` patterns like `ab` and `abc` - maybe you want to rewrite them so the extensions are optional like `abc?` or maybe not. I'm not sure. It's also confusing that for the input `'tumb b~'` the expected output is `''tumb'` because the `\\S+` in your pattern is specifically **not** replacing spaces, but here you want to replace the space? – Gregor Thomas Feb 14 '23 at 02:57
  • I feel sorry it's my mistake that I do not express clearly. In fact, I want to replace b with a. As you see, a and b have some similar parts on the left, I want to remove the difference from a. But should be greedy match. For example, the result of 'tumb b~‘ should be 'thumb' not 'tum' and the result of 'mnb345‘ should be 'mnb' not 'mn'. I just learn regex expresion, so my try may be complex and cumbersome. Looking forward for your reply! – flora micy Feb 14 '23 at 04:26
  • Should the last desired match be `lymac` or `lymav`? – GKi Feb 14 '23 at 08:38
  • o yes an error occurs I would revise... – flora micy Feb 14 '23 at 11:58

2 Answers2

2

Maybe you are looking for adist.

a <- c('abc2','xycd2','mnb345','tumb b~','lymavc') 
b <- c('ab','abc','xyc','mnb','tum','mn','tumb','lym','lymav')
b[apply(adist(b, a) + adist(b, a, partial=TRUE), 2, which.min)]
#[1] "abc"   "xyc"   "mnb"   "tumb"  "lymav"

a <- c('tums310','tums310~20','tums320')  
b <- c('tums1','tums2','tums3')
b[apply(adist(b, a) + adist(b, a, partial=TRUE), 2, which.min)]
#[1] "tums3" "tums3" "tums3"
GKi
  • 37,245
  • 2
  • 26
  • 48
  • Thanks for your timely help. A new questions occurs. `a <- c('tums310','tums310~20','tums320') ` `b<-c('tums1','tums2','tums3') ` `b[apply(adist(b, a), 2, which.min)]` the result is: `[1] "tums1" "tums1" "tums2"` however I want the result should be `"tums3" "tums3" "tums3"` – flora micy Feb 14 '23 at 11:35
  • I updated using in addition `partial` to find the desired result. – GKi Feb 14 '23 at 15:28
  • Thank you very much!! I will spend time in learning these. Good wishes for you! – flora micy Feb 15 '23 at 02:02
0

Here's a fuzzy_join solution with the function stringdist_join:

library(fuzzyjoin)
stringdist_join(
  # join `b` as a dataframe ... 
  data.frame(b),
  # ... with `a` as a dataframe:
  data.frame(a),
  # join by ...:
  by = c("b" = "a")
  # use left join:
  mode = 'left',
  # use Jaro-Winkler distance metric:
  method = "jw",
  # enable case-insensitive matching:
  ignore_case = TRUE,
  # name for distance column:
  distance_col = 'dist') %>% 
# retain only closest matches:
group_by(a) %>%
  slice_min(order_by = dist, n = 1)
# A tibble: 5 × 3
# Groups:   a [5]
  b     a         dist
  <chr> <chr>    <dbl>
1 abc   abc2    0.0833
2 lymav lymavc  0.0556
3 mnb   mnb345  0.167 
4 tumb  tumb b~ 0.143 
5 xyc   xycd2   0.133

b contains now the most closely matching values for a.

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34