7

I am planning to do some data tuning on my data.

Situation-I have a data which has a field country. It contains user input country names( It might contain spelling mistakes or different country names for same country like US/U.S.A/United States for USA). I have a list of correct country names.

What I want- To predict which closest country it is referring to. For example- If U.S. is given then it will change to USA(correct country name in our list).

Is there any way I can do it using Java or opennlp or any other method?

MWiesner
  • 8,868
  • 11
  • 36
  • 70
AngryLeo
  • 390
  • 4
  • 23

3 Answers3

3

You can use Getty API . It will give you abbreviations of country name. Just play on this API.

OR

You can also use Levenshtein Distance to get most closest country name.

Try this out. Will help you.

iNikkz
  • 3,729
  • 5
  • 29
  • 59
  • Levenshtein Distance is useful!! But the issue is,for country like `USA`, if the data has `United States` then distance will come much more than what it should be!! – AngryLeo Jan 27 '16 at 07:15
  • @AyushBanka: That time, you can use API which I have added in answer. This [Git code](https://gist.github.com/maephisto/9228207) may help you. You can add in yours. – iNikkz Jan 27 '16 at 07:18
1

You can try Google's auto complete location api to your text box or select. if you will use this api then you will get google like auto complete intellisence while typing. visit link

Nitin Dhomse
  • 2,524
  • 1
  • 12
  • 24
  • I want to do the data tuning in the back-end with the data I have got.I am not sure if auto Complete will be helpful. Correct me if I am wrong – AngryLeo Jan 27 '16 at 06:42
0

If you have the city or state information that is sanitized then you could do a look up of the country.

You could also define aliases in your list of country names and point the aliases to the preferred notation. For example, US, United States, USA all are aliases of U.S.A. You could make the program to append to alias database so that it improves as it is being used. You might have do multiple passes over the data and also certain amount of manual work is involved.

Vasco
  • 782
  • 1
  • 5
  • 22