9

In ?agrep (grep with fuzzy matching) it mentions that I can set the argument fixed=FALSE to let my pattern be interpreted as a regular expression.

However, I can't get it to work!

agrep('(asdf|fdsa)', 'asdf', fixed=F)
# integer(0)

The above should match as the regular expression "(asdf|fdsa)" exactly matches the test string "asdf" in this case.

To confirm:

grep('(asdf|fdsa)', 'asdf', fixed=F)
# 1 : it does match with grep

And even more confusingly, adist correctly gives the distance between the pattern and string as 0, meaning that agrep should definitely return 1 rather than integer(0) (there's no possibility that 0 is greater than the default max.dist = 0.1).

adist('(asdf|fdsa)', 'asdf', fixed=F)
#      [,1]
# [1,]    0

Why is this not working? Is there something I don't understand? A workaround? I'm happy to use adist, but am not entirely sure how to convert agrep's default max.distance=0.1 parameter to adist's corresponding parameter.

(yes, I'm stuck on an old computer that can't do better than R 2.15.2)

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i686-redhat-linux-gnu (32-bit)    
locale:
 [1] LC_CTYPE=en_AU.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_AU.utf8        LC_COLLATE=en_AU.utf8    
 [5] LC_MONETARY=en_AU.utf8    LC_MESSAGES=en_AU.utf8   
 [7] LC_PAPER=C                LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_AU.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base 
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • I am unable to explain this behaviour as well, but `aregexc` seem to work. `aregexc('asdf|fdsa', 'asdf')` (which takes default `fixed = FALSE`) and does regexp by default. But surprisingly `fixed=TRUE` here with `aregexc` gets back to the same issue. – Arun Apr 08 '13 at 05:34
  • Where did you get `aregexc` from? I'm guessing this is in a newer version of R than I have? – mathematical.coffee Apr 08 '13 at 05:35
  • It mentions under the help page of `?agrep`, but yes I do have R 3.0.0. your problem seems to be with `|`. Because `agrep('la[sb]y', 'lazy', fixed=FALSE)` works. – Arun Apr 08 '13 at 05:38
  • 2
    It's `aregexec`. Sorry for the typo. – Arun Apr 08 '13 at 05:41
  • Thanks for `aregexec` (my `?agrep` does not mention it). – mathematical.coffee Apr 08 '13 at 05:45
  • 1
    Okay, I really need a coffee. It doesn't mention in `?agrep`. Once again sorry about that. I've now no idea where I saw it! :) – Arun Apr 08 '13 at 05:50

2 Answers2

7

tl;dr: agrep(..., fixed=F) does not seem to work with the '|' character. Use aregexec.

Upon further investigation, I think this is a bug, and that agrep(..., fixed=F) does not seem to work with '|' regexes (although adist(..., fixed=F) does).

To elaborate, note that

adist('(asdf|fdsa)', 'asdf', fixed=T) # 7
nchar('(asdf|fdsa)')                  # 11

If 'asdf' were agrep'd to the non-regular-expression string '(asdf|fdsa)', then it would have distance 7.

On that note:

agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=7) # 1
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=6) # integer(0)

These are the results I'd expect if fixed=T. If fixed=F, my regex would match 'asdf' exactly and the distance would be 0, so I'd always get a result of '1' back out of agrep.

So it looks agrep(pattern, x, fixed=F) does not work, i.e. it actually regardes fixed as TRUE for this sort of pattern.

As @Arun mentions, it might just be '|' regexes that don't work. For example, agrep('la[sb]y', 'lazy', fixed=FALSE) does work as expected.


EDIT: Workaround (thanks @Arun)

The function aregexec appears to work.

> aregexec('(asdf|fdsa)', 'asdf', fixed=F)
[[1]]
[1] 1 1
attr(,"match.length")
[1] 4 4
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
1

This has (finally) been fixed in the R sources "trunk" / R-devel") and R-patched which will become R 3.5.1 early July 2018.

Martin Mächler
  • 4,619
  • 27
  • 27