1

I have written this following piece of code for finding the similarity between two postal addresses

    double similarAddr(String resAddr,String newAddr)
    {
       String sortedResAddr=asort(resAddr); //asort alphabetically sorts the sentence passed as its parameter
       String sortedNewAddr=asort(newAddr);
       String[] addrToks=sortedResAddr.split("[ ]+");
       String[] newToks=sortedNewAddr.split("[ ]+");
       int l1=addrToks.length;
       int l2=newToks.length;
       double similarity=0.0;
       int lengths,lengthl; //lengths is length of shorter string while lengthl is that of longer string
       if(l1<l2)
       {
         lengths=l1;
         lengthl=l2;
         for(int i=0;i<l1;i++)
           {
            double max=0.0;
            for(int j=i;j<l2;j++)
               {

                 double curr_similarity=findSimilarity(addrToks[i],newToks[j]); //findSimilarity calculates similarity between two string based on their edit distance, it first calculates the edit distance and normalize by dividing it by the longer string length and subtracts it from 1
                 if(max<curr_similarity)
                      max=curr_similarity;
                }
               similarity+=max;
            }

        }
       else
        {
           lengths=l2;
           lengthl=l1;
           for(int i=0;i<l2;i++)
             {
                 double max=0.0;
                 for(int j=i;j<l1;j++)
                    {

                      double curr_similarity=findSimilarity(newToks[i],addrToks[j]);
                      if(max<curr_similarity)
                         max=curr_similarity;
                     }
                  similarity+=max;
              }
         }
    similarity/=lengths;
    return similarity;
}

But with this approach I am finding many false positives. Here I have taken the threshold as 0.5 i.e if similarity score is above 0.5 then they are potentially similar. But only increasing threshold does not solve my problem because many dissimilar addresses have similarity score about 0.7 or so and it may miss many really similar pairs whose similarity scores are near about 0.6 or so.

For example similarity between following two addresses 9/18, Ekdalia Road, Gariahat, Kolkata and 1/3, City Mall, Jessore Road, Near Dak Banglow More, Barasat, Kolkata - 700124 is coming as 0.6488, but they are not same at all.

So I am asking if anyone can suggest a better approach for doing the same. Thank you.

Joy
  • 4,197
  • 14
  • 61
  • 131
  • What is the `findSimilarity(...)` method doing? Maybe post the code for us as I'm guessing some important calculations are going on there. – Trent Feb 13 '13 at 05:15
  • Yeah findSimilarity() method is calculating the edit distance between pair of strings and divides it by the length of the larger string and then subtracts it from 1. – Joy Feb 13 '13 at 07:54

2 Answers2

2

Token comparison on addresses will not give you very good results, because the components of the address have differing importance. For example, the similarity of street names does not matter much unless the city names also match.

To do a good job of address comparison, you need to attempt to parse out the hierarchical nature of the address - street, city, state, country, etc. and compare addresses in a hierarchical manner.

If you don't want to go to this effort, you can improve your results by eliminating "stop words". For example, words like "street", "road", etc. occur frequently, and are not good discriminators - they make adresses seem more similar than they are.

kc2001
  • 5,008
  • 4
  • 51
  • 92
  • Thank you Sir. Can you suggest me if there is any tool that segments the addresses in those components like street, city, state, country etc. because address segmentation itself is pretty difficult task as you need to construct a large number of training examples, need to construct a good model that will learn from those examples. I have gone through paper related to this segmentation but did not find any such tool that I can use in my program to do that job. – Joy Feb 24 '13 at 04:36
  • I don't know of any such tool, although I'd bet that some exist. You might try Google code search to look for an address parser. – kc2001 Feb 24 '13 at 17:38
  • I don't see why you need to learn the address format from examples. Can't you apply knowledge that you already possess to parse the addresses? For example, with US addresses, you can start at the end (zip code) and then determine the state and city fairly easily. – kc2001 Feb 24 '13 at 17:45
  • But Sir, addresses in India are not like US addresses. Essential address attributes are House No, Street Name, City, State. In many of the addresses they may not be present and even if they are present they do not follow any regular order so basically addresses are noisy. Some examples are as follows: (A) Veera Desai Junction, J.P.Road,Opposite Apna Bazaar, 7 Bungalows, Andheri West, Mumbai (B) Infinity Shopping Complex, Ground Floor, Majiwada, Pokhran Road 2, Thane Area West, Mumbai (C) 89C, Maulana Abdul Kalam Azad Sarani, Inox Building, Near Swabhumi, E M Bypass, Kolkata etc. – Joy Feb 25 '13 at 01:58
  • That does make things significantly harder. Maybe you could incorporate lists of states and cities to detrmine the coarser levels of similarity? – kc2001 Feb 25 '13 at 13:09
  • Yes Sir but I think another approach you have suggested i.e. doing string similarity between the two address phrases after removing stopwords will give better result. – Joy Feb 25 '13 at 13:32
1

I think kc2001 is right: you need to parse the addresses out into separate fields. It looks like Gisgraphy has a parser that works for Indian addresses.

If you can also geocode the addresses to lat/long coordinates that also helps a lot, because sometimes the same place can be described with multiple addresses. From the description it seems Gisgraphy can do that, too.

However, parsing the addresses is only the first step. After that you need to compare them, and I've found that you need a pretty fine-tuned comparator to get that to work. For example, 9/18, Ekdalia Road is a completely different place from 382/21, Ekdalia Road, even if the strings are very similar. I've had good results from using weighted Levenshtein comparison for street addresses and weighting digits higher than letters.

I wrote a deduplication tool called Duke which will let you compare parsed addresses by comparing the fields separately using weighted Levenshtein and other comparators, and then combine the results for the various fields into a single similarity value. I've used it successfully to deduplicate both customer data and hotel data, among other things.

You need to configure and tune it a bit, but that should be vastly easier than doing all this yourself.

larsga
  • 654
  • 6
  • 10
  • Good stuff (+1), larsga. Your comment about lat/lons ("If you can also geocode the addresses to lat/long coordinates that also helps a lot") raises a most important point - what similarity/distance are you attempting to measure? If it is purely geographical distance, then you might be able to largely dispense with the textual comparison and plug the addresses into a web service equivalent of Google Maps to calculate physical distance. – kc2001 Mar 05 '13 at 13:22
  • Purely geographic distance is in my experience not enough, for three reasons. In some cases the coordinates will be wrong. Further, the resolution is not perfect, so sometimes coordinates will be 100-300 meters off. And, finally, sometimes different addresses are really close together. – larsga May 14 '13 at 11:51