-1

I am trying to match Cell Phone Tower IDs contained in one table with a master table of locations(in lat long) of Cell Phone Tower IDs. The format of IDs in the locations table are different from the ones in the first table and I am trying to use agrep() to do a fuzzy match. To give you an example, let's say the ID I am trying to match is:

x <- c("405-800-125-39883")

A sample of IDs located in the locations table:

y <- c("405-810-1802-19883", "405-810-2101-29883", "405-810-1401-31883",
       "405-810-5005-49883","125-39883","405-810-660-39883")

I am then using agrep() with different combinations of max.distance:

agrep(x,y,max.distance=0.3,value=TRUE)

This returns:

[1] "405-810-1802-19883" "405-810-2101-29883" "405-810-1401-31883" "405-810-5005-49883"
[5] "405-810-660-39883"

Whereas the value that I am really after is "125-39883" I have also tried the stringdist_join() function from the stringdist package and applied to the two data frames bby varying max_dist but with no success. Basically what I am looking for is a perfect match after the last hyphen and then macth on the number on the second last hyphen and so on. Is there any way of doing that?

Dhiraj
  • 1,650
  • 1
  • 18
  • 44
  • If yo uare looking for an exact match then strip off the substring and then do `==` `y[y == substring(x, 9)]` – akrun Mar 04 '18 at 06:59
  • Thanks @akrun but not quite. The data I have shared is just a sample for demonstration. Your solution will work for this particular example. However, the value might also be stored as `"800-125-39883"` or as `"5-800-125-39883"`. So can't really specify 9 in `substring()` – Dhiraj Mar 04 '18 at 07:04
  • The problem is that in agrep or others, it is matching for the other digits and it can't really differentiate between the last few elements. Ir ia better to provide example with all the cases – akrun Mar 04 '18 at 07:05
  • @akrun Just thinking aloud here, but do you think splitting on "-" and then matching by starting from the last value would help? I feel it might but not sure how to do that. – Dhiraj Mar 04 '18 at 07:08
  • It is still not clear to me which numbers you are matching – akrun Mar 04 '18 at 07:09
  • If you look at any ID of a cell tower, it is broken into 4 parts. For example, in 405-810-1802-19883, 405 is the Country Code, 810 is the Mobile Network Code, 1802 is the Location Area Code (LAC) and 19883 is the Cell ID. I would first like to match the Cell ID, but because it will not be unique as different states in the country could have the same Cell IDs, I would then match on LAC. I have already filtered for the Mobile network code and the country code. – Dhiraj Mar 04 '18 at 07:14

1 Answers1

0

You can vectorized agrep to be able to use all the values of y as the pattern. Your aim is to look for the whole of y as a part of x. Thus your pattern should be y and not x

names(unlist(Vectorize(agrep)(y,x)))
[1] "125-39883"   

Although we can use adist with the argument partial=TRUE so that it may do exactly what agrep does:

 y[which.min(c(adist(y,x,partial = T)))]
    [1] "125-39883"

If x is a vector and y is also a vector, you would rather use adist instead of agrep. All the arguments of agrep are contained in adist. Check ?adist for further details.

with your new question in the comments, you can do something like this:

w=adist(y,x,partial=T)
z=setNames(nchar(sub(".*?(M*)$","\\1",c(attr(adist(y,x,counts=T),"trafos")))),y)
names(which.max(z[which(min(w)==w)]))
[1] "126-39883"
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks @Onyambu. However I think I should be matching on the last set of numbers. For example, if `y <- c("405-810-1802-19883", "405-810-2101-29883", "405-810-1401-31883", "405-810-5005-49883","125-49883","405-810-660-39883","126-39883")` then the result I would be interested in would be `"126-39883"` rather than `"125-49883"` that your code returns. – Dhiraj Mar 04 '18 at 07:43
  • The problem is you did not state this in your original question. Always ensure to give an example that captures your problem – Onyambu Mar 04 '18 at 07:45
  • Agreed, my bad. – Dhiraj Mar 04 '18 at 07:46