2

I used Agrepl for fuzzy matching between two sets of addresses. The documentation says that the default is:

If cost is not given, all defaults to 10%, and the other transformation number bounds default to all. The component names can be abbreviated.

However, reading this q&a with this example, that doesn't seem to match up. Here is that example:

agrepl("cold", "cool")
#> [1] FALSE
agrepl("cool", "cold")
#> [1] TRUE

From the description, I'd imagine that calculating the 10% would be having 1 change in a 10 letter word, but this is 1 in 4. How exactly is this calculated?

tchoup
  • 971
  • 4
  • 11

1 Answers1

0

This is admittedly very confusing (at least to me!), but here's my attempt to explain it. The linked answer says:

The default maximum amount of transformations for a pattern of length 4 is 1.

How do we get from 0.1 (cost) × 4 (pattern length) to 1? Well, ?agrepl notes that the max.dist is expressed as

as a fraction of the pattern length times the maximal transformation cost (will be replaced by the smallest integer not less than the corresponding fraction)

(emphasis added); I take the parenthetical clause to mean that the maximum number of transformations is ceiling(0.1*4) = 1. We would need a pattern with length ≥ 11 in order for ceiling(0.1*pattern_length) to increase from 1 to 2 ...

If you want to find out where this is actually implemented, you have to dig fairly deep into the C source code, i.e. lines 59-60 of agrep.c, in the amatch_regparams function, where we see

if(bound < 1) bound *= (patlen * max_cost);
params->max_cost = IntegerFromReal(ceil(bound), &warn);
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Follow up question, maybe better for cross validated (but not sure)... If I want to use this and the default setting for doing fuzzy matching, and I write that up in my methods, is there any text I can point to to say, look this strategy is reasonable? After I did this I spot checked the data with a random sample, but my advisor is looking for something more when I'm explaining it. – tchoup May 27 '22 at 23:11
  • I have no idea. Maybe ask on the `r-help@r-project.org` mailing list ... ???? This might be the original author ... https://www.technikum-wien.at/en/staff/david-meyer/ – Ben Bolker May 27 '22 at 23:35