Difficulties with `agrep(..., fixed=F)`

Question

In ?agrep (grep with fuzzy matching) it mentions that I can set the argument fixed=FALSE to let my pattern be interpreted as a regular expression.

However, I can't get it to work!

agrep('(asdf|fdsa)', 'asdf', fixed=F)
# integer(0)

The above should match as the regular expression "(asdf|fdsa)" exactly matches the test string "asdf" in this case.

To confirm:

grep('(asdf|fdsa)', 'asdf', fixed=F)
# 1 : it does match with grep

And even more confusingly, adist correctly gives the distance between the pattern and string as 0, meaning that agrep should definitely return 1 rather than integer(0) (there's no possibility that 0 is greater than the default max.dist = 0.1).

adist('(asdf|fdsa)', 'asdf', fixed=F)
#      [,1]
# [1,]    0

Why is this not working? Is there something I don't understand? A workaround? I'm happy to use adist, but am not entirely sure how to convert agrep's default max.distance=0.1 parameter to adist's corresponding parameter.

(yes, I'm stuck on an old computer that can't do better than R 2.15.2)

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i686-redhat-linux-gnu (32-bit)    
locale:
 [1] LC_CTYPE=en_AU.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_AU.utf8        LC_COLLATE=en_AU.utf8    
 [5] LC_MONETARY=en_AU.utf8    LC_MESSAGES=en_AU.utf8   
 [7] LC_PAPER=C                LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_AU.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

I am unable to explain this behaviour as well, but `aregexc` seem to work. `aregexc('asdf|fdsa', 'asdf')` (which takes default `fixed = FALSE`) and does regexp by default. But surprisingly `fixed=TRUE` here with `aregexc` gets back to the same issue. — Arun, Apr 08 '13 at 05:34
Where did you get `aregexc` from? I'm guessing this is in a newer version of R than I have? — mathematical.coffee, Apr 08 '13 at 05:35
It mentions under the help page of `?agrep`, but yes I do have R 3.0.0. your problem seems to be with `|`. Because `agrep('la[sb]y', 'lazy', fixed=FALSE)` works. — Arun, Apr 08 '13 at 05:38
Okay, I really need a coffee. It doesn't mention in `?agrep`. Once again sorry about that. I've now no idea where I saw it! :) — Arun, Apr 08 '13 at 05:50

mathematical.coffee · Accepted Answer · 2013-04-08T05:53:23.183

tl;dr: agrep(..., fixed=F) does not seem to work with the '|' character. Use aregexec.

Upon further investigation, I think this is a bug, and that agrep(..., fixed=F) does not seem to work with '|' regexes (although adist(..., fixed=F) does).

To elaborate, note that

adist('(asdf|fdsa)', 'asdf', fixed=T) # 7
nchar('(asdf|fdsa)')                  # 11

If 'asdf' were agrep'd to the non-regular-expression string '(asdf|fdsa)', then it would have distance 7.

On that note:

agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=7) # 1
agrep('(asdf|fdsa)', 'asdf', fixed=F, max.distance=6) # integer(0)

These are the results I'd expect if fixed=T. If fixed=F, my regex would match 'asdf' exactly and the distance would be 0, so I'd always get a result of '1' back out of agrep.

So it looks agrep(pattern, x, fixed=F) does not work, i.e. it actually regardes fixed as TRUE for this sort of pattern.

As @Arun mentions, it might just be '|' regexes that don't work. For example, agrep('la[sb]y', 'lazy', fixed=FALSE) does work as expected.

EDIT: Workaround (thanks @Arun)

The function aregexec appears to work.

> aregexec('(asdf|fdsa)', 'asdf', fixed=F)
[[1]]
[1] 1 1
attr(,"match.length")
[1] 4 4

score 1 · Answer 2 · answered Jun 20 '18 at 06:02

1

This has (finally) been fixed in the R sources "trunk" / R-devel") and R-patched which will become R 3.5.1 early July 2018.

answered Jun 20 '18 at 06:02

Martin Mächler

4,619
27
27

Difficulties with `agrep(..., fixed=F)`

2 Answers2

EDIT: Workaround (thanks @Arun)

Linked