0

I'm trying to build a function that uses R's agrepl for approximate matching. I am using a regex pattern which from my perspective is not treated as regex.

I came to this conclusion by running following test in my REPL:

> patterns <- c("ha","^ha","ha$","^ha$","(^)ha","ha($)")

> sapply(patterns,agrepl,x="ha",max.distance=0L,fixed=FALSE)
  ha   ^ha   ha$  ^ha$ (^)ha ha($) 
TRUE  TRUE  TRUE  TRUE FALSE FALSE 

> sapply(patterns,grepl,x="ha",fixed=FALSE)
  ha   ^ha   ha$  ^ha$ (^)ha ha($) 
TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

I'm not that good at using regex but I'm pretty sure that all of my patterns should match "ha".

Assuming that I'm right and above behavior should not be happening, would you be able to propose another function/solution to match my patterns to "ha"?

To be more specific I need a fuzzy matcher that will help me find keywords in unstructured data.

UPDATE I should point out that the only reason why I', using regular expressions is because I am looking for keywords (matches with spaces around them). If I can ensure that "haha" will not match "ha" but "ha foo" will then regex is not necessary for this problem.

sgp667
  • 1,797
  • 2
  • 20
  • 38
  • You can try `str_detect()` – BENY Apr 16 '18 at 19:26
  • You could play around with the `max.distance` argument – Mike H. Apr 16 '18 at 19:27
  • The usual regex solution to your problem would be to use word `\b`oundaries (but they would also match keywords between special chars, e.g. `¡ha!`) or lookarounds. I'm not too sure of what you're trying to do with `(^)ha` and `ha($)` as it seems useless to capture 0-length matches, but I agree it should work. I've no experience with R nor agrep so I hope I'm not misunderstanding your question. – Aaron Apr 16 '18 at 19:49
  • yes I do think that (^)ha is useless but if you look at results above it does not match results of ^ha which is odd to me. – sgp667 Apr 16 '18 at 20:25
  • Agreed, all regex flavours I'm familiar with would match without problem and return an empty capturing group along with the full match. It may be worth opening a bug report (if possible ; again, not familiar with R) as this is most likely an irregular use case and might uncover some implementation weaknesses / problems. – Aaron Apr 17 '18 at 09:33

0 Answers0