0

I have data of phone numbers and village names collected from the villagers via forms. Because of various reasons the data is inaccurate or incomplete.

The idea is to validate these two data points before adding them to the data base/store.

  1. The phone numbers are being formatted programmatically and validated via an external API. (That gives me the service provider and province information).

  2. The problem is with the addresses.

No standardized address line. Tons of ambiguity.

Numeric street names and door numbers exist.

Input string will sometimes contain an addressee.

Possible solutions I can think of

  • Reverse geocoding helps. But not very accurate when it comes to Indian context. The Google TOS also prohibits automated queries. (correct me if I'm wrong here)

  • Soundexing. Again not very accurate with Indian data.

I understand it's difficult to such highly unstructured data, but I'm looking for a ways to achieve atleast enough accuracy to map addresses to the nearest point of interest.

Queries

Given a village name from the villager who might spell it wrong or incorrectly or abbreviate it how do I get the correct official name of the village and location?

Any possible ways to sanitize bad location/addresses or decode complex/poorly formed addresses?

Are there any machine learning solutions that can help so I can learn from every computation?(I have 0 knowledge on ML, do correct me if I'm wrong here.)

bittterbotter
  • 445
  • 6
  • 18
  • To me the best thing you can do is to use a spell checker. Use wiki articles that contain description of Indian cities, towns and villages for training the spell checker. – Riyaz Dec 29 '15 at 13:32
  • Thanks for that @Riyaz. It would solving possible spelling errors. I guess that wouldn't solve the issue of complex/poorly formed addresses though. That being the main issue. Any thoughts on that? – bittterbotter Dec 29 '15 at 14:19

1 Answers1

1

What you want is a geolocation system that works with informal text input. I have a previously used a Text-based geolocation model trained on Twitter data.

To solve your problem, you need training data in the form of:

informal_text              village_name

If you have access to such data (e.g. using the addresses which can be geolocated) then you can train a text-based classifier that given a new informal address can predict where on the map it points to. In your case every village becomes a class label. You can use scikit-learn to train the classifier.

Ash
  • 3,428
  • 1
  • 34
  • 44