How to filter non-address information from a string

Question

I'm trying to reduce the time users need to spent on filling in an address form. This form requires the street address, postal code, city, district, and sub district.

To do so I query Open Street Map's Nominatim API like so:

var
  request = require('request'),
  address = 'Grand Parkview Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road',
  baseUri = 'http://nominatim.openstreetmap.org/search?format=json&addressdetails=1',
  query   = '&accept-language=en&q=' + encodeURIComponent(address);

request(baseUri + query, function(err, res, body) {
  console.log(JSON.parse(body));
});

I then parse things like the postcode from the returned body.

The problem is that this only works for "normal addresses" that do not contain irrelevant things like the floor number. In other words, this works:

var address = 'Sukhumvit 21 Road';

But this does not work:

var address = 'Grand Parkview Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road';

Now I am querying the API many times with a very crude set of possible trials, like so:

  //create trials
  var
    trials = [],
    addressParts = address.split(' ');

  for (var i = 0, il = addressParts.length; i < il; i++) {
    if (il - i >= 2) trials.push(addressParts.slice(i, il).join(' '));
  }

Which means that it will attempt all of these strings:

Grand Parkview Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road
Parkview Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road
Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road
Unit 255/109 15th Flr., Sukhumvit 21 Road
255/109 15th Flr., Sukhumvit 21 Road
15th Flr., Sukhumvit 21 Road
Flr., Sukhumvit 21 Road
Sukhumvit 21 Road ==> it works!

Which requires many requests and is therefore very slow.

Is there a smarter way to filter out this "non-address" information? Note that I'm also looking for a way to do this in non-western scripts such as Thai.

For US addresses, SmartyStreets has an [address extraction API](http://smartystreets.com/products/liveaddress-api/extract) that can help... for Thai, though, is beyond me... — Matt, Apr 16 '14 at 13:15
@Matt I guess my real question is how your algorithm works. ;) — Tom, Apr 16 '14 at 13:46
Well, it's complicated, to say the least... I wish I could help you more here. — Matt, Apr 16 '14 at 14:47
Please don't perform bulk queries on OSM's Nominatim instance, it is against the [usage policy](https://wiki.openstreetmap.org/wiki/Nominatim_usage_policy). Instead use a different instance, for example the one provided by [MapQuest](http://developer.mapquest.com/web/products/open/nominatim) or [install a local instance](https://wiki.openstreetmap.org/wiki/Nominatim/Installation). — scai, Apr 22 '14 at 09:03
I think the best approach is to get a list of all Thai addresses, build a corpora with all words and their frequencies, then parse the returned address against that vocabulary with frequencies. The ones that match have a good chance of being part of postal address. This problem is about Named Entity Recognition. — Marcel, Apr 24 '14 at 14:25
The low-tech way is to just specify what info you expect in the input field, ie be specific about what info the user should enter. For instance you might call this field "street address". You could also have commas trigger a warning popup explaining what the expected data is. — Magnus, Apr 25 '14 at 19:58

score 1 · Answer 1 · answered Apr 26 '14 at 07:12

If you do not have a well defined form with specific address fields. A good approach would be to split the string address into well defined parts. eg.

"Grand Parkview Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road"
 1st part: "Grand Parkview Asoke"
 2nd part: "Unit 255/109 15th Flr."
 3rd part: "Sukhumvit 21 Road"
 // Of course it would be more complex than just splitting at commas.

Addresses traditionally are written in a form where information is dense towards the end of the string. By dense I mean the probability of the amount of information available on internet is on the higher side. So a query with the last part ie. Sukhumvit 21 Road on the source of information is likely to give you more results than the full string in one shot.

Now depending on the no. of results you receive. you may build an approach like:
1) More than one results: Add more information to your query string eg. Unit 255/109 15th Flr., Sukhumvit 21 Road
2) No results: Remove a part of the query string eg. 21 Road

Else as others have already suggested, if you break down your form into distinct address parts viz. Street address, etc. You will be in a better state to form queries.

Then again, this is what I though. There will definitely be much better approaches based on mathematical modeling of the problem.

People usually don't include comma's though, and the answer doesn't take into account situations where the address is in chinese or thai script. — Tom, Apr 27 '14 at 08:03

score 0 · Answer 2 · answered Apr 21 '14 at 09:58

You should look at sequence tagging problems---this is essentially what you have here. One prominent example is named entity recognition.

The task is to see what you're trying to extract from the string as words with a particular tag. Let's say the tag is 'Relavant'. You can then think of each word in your string as having a corresponding tag:

'Grand Parkview Asoke, Unit 255/109 15th Flr., Sukhumvit 21 Road'
'[NR]  [NR]            [NR] [NR]    [NR] [NR]  [R]       [R][R]

where I used the tag [R] to indicate the word is relevant to the query. You can then build your query with just the relevant words (or indeed the longest contiguous string of relevant words to increase robustness, if that's more suitable).

The task then is to build a sequence tagger that identifies relevant from irrelevant words in the query. You'll want to approach this as a supervised problem, which means training data (although you could acquire training data by assuming that any query that returns a result is valid and all others are invalid). The most competitive sequence taggers are conditional random fields, which can do tagging quickly and with high accuracy.

Beware though---this isn't a quick fix. You'll need to invest a fair amount of time into gathering the data, identifying relevant features, and evaluating. I don't know how important this stuff is to you!

This is interesting, however, how can such training data ever distinguish between things like Grand Parkview Asoke being not relevant and Sukhumvit being relevant? These are both names and even with a million tests Sukhumvit may only be entered once, since this is just a single street name. — Tom, Apr 21 '14 at 12:12
The idea would be to use features of the words (like their position in the string, whether they matched lists of known streets/buildings, the nearby words (i.e. if it's next to "road" it's probably a street etc). Word identity is just one possible feature, and I agree in this instance it's probably not useful — Ben Allison, Apr 21 '14 at 13:48
Wouldn't you say that due to the arbitrary randomness of these strings it makes more sense to simply create a more senseful order of trials? For example, it would make more sense to try words in batches rather than to start with the full string and then remove words one by one. — Tom, Apr 21 '14 at 15:09

How to filter non-address information from a string

2 Answers2