1

How can "abteam" with "ab" be matched using this code?

agrep("abteam",c("acb","abd","ab"),value=T,ignore.case = TRUE,max = list(del = 10, ins = 10, sub = 10))

The result is character(0), though I specified del=10, ins=10. What is the problem? How does agrep work?

Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
Kavipriya
  • 441
  • 4
  • 17

1 Answers1

2

From the help file:

If ‘cost’ is not given, ‘all’ defaults to 10%, and the other transformation number bounds default to ‘all’.

As far as I understand it means that either cost or all is a limiting factor even if you set del, ins and sub. If you want to allow 10 transformations you can simply set max = 10. Additional parameters can be used to limit specific transformations, for example:

> x <- c("fooar","ooar","foobaz")
> agrep("foobar", x, value=T, max = list(all = 3, del = 0, ins = 0))
[1] "foobaz"

In your case you could use max = list(all = 10 ,del = 10, ins = 10, sub = 10)).

zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thanks! It works. Is it possible to give importance to particular part of the string? Say, matching should be done based on first name rather than second name. – Kavipriya Jun 23 '15 at 08:36
  • As far as I am concerned no. If you want something like you'll have to provide your own logic. – zero323 Jun 23 '15 at 12:17
  • To be honest, this still doesn't make sense to me. For the OP's example, if one uses `agrep(pat, x, value=T, max = list(all = 10 ,del = 10, ins = 0, sub = 10))` nothing is returned. That makes no sense - 4 deletions, nothing else, and `"abteam"` matches `"ab"`. – thelatemail Jun 23 '15 at 23:19
  • 1
    Good point. What is interesting [Python TRE library](https://github.com/laurikari/tre/) returns a match: `import tre; fz=tre.Fuzzyness(maxerr=10, maxdel=10, maxins=0, maxsub=10); pt=tre.compile("abteam"); pt.search("ab", fz)`. – zero323 Jun 24 '15 at 00:07
  • 1
    @zero323 - well, I'm stumped then. `agrep` is so utterly confusing to me that I have usually used `adist` in the past instead - at least it gives nice clear values I can test against. – thelatemail Jun 24 '15 at 04:48
  • @thelatemail Separation of insertions and deletions suggests it is actually asymmetric, but it is not supported by Python behavior. There is definitely something weird going on here. `adist` is pretty nice. Another alternative is [Biostrings](http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html) library. – zero323 Jun 24 '15 at 22:24