I have written this following piece of code for finding the similarity between two postal addresses
double similarAddr(String resAddr,String newAddr)
{
String sortedResAddr=asort(resAddr); //asort alphabetically sorts the sentence passed as its parameter
String sortedNewAddr=asort(newAddr);
String[] addrToks=sortedResAddr.split("[ ]+");
String[] newToks=sortedNewAddr.split("[ ]+");
int l1=addrToks.length;
int l2=newToks.length;
double similarity=0.0;
int lengths,lengthl; //lengths is length of shorter string while lengthl is that of longer string
if(l1<l2)
{
lengths=l1;
lengthl=l2;
for(int i=0;i<l1;i++)
{
double max=0.0;
for(int j=i;j<l2;j++)
{
double curr_similarity=findSimilarity(addrToks[i],newToks[j]); //findSimilarity calculates similarity between two string based on their edit distance, it first calculates the edit distance and normalize by dividing it by the longer string length and subtracts it from 1
if(max<curr_similarity)
max=curr_similarity;
}
similarity+=max;
}
}
else
{
lengths=l2;
lengthl=l1;
for(int i=0;i<l2;i++)
{
double max=0.0;
for(int j=i;j<l1;j++)
{
double curr_similarity=findSimilarity(newToks[i],addrToks[j]);
if(max<curr_similarity)
max=curr_similarity;
}
similarity+=max;
}
}
similarity/=lengths;
return similarity;
}
But with this approach I am finding many false positives. Here I have taken the threshold as 0.5 i.e if similarity score is above 0.5 then they are potentially similar. But only increasing threshold does not solve my problem because many dissimilar addresses have similarity score about 0.7 or so and it may miss many really similar pairs whose similarity scores are near about 0.6 or so.
For example similarity between following two addresses 9/18, Ekdalia Road, Gariahat, Kolkata and 1/3, City Mall, Jessore Road, Near Dak Banglow More, Barasat, Kolkata - 700124 is coming as 0.6488, but they are not same at all.
So I am asking if anyone can suggest a better approach for doing the same. Thank you.