How does the Google geocoder work?

Question

I am curious as to how the Google geocoder works.

I have been studying some implementations of open source geocoders like geocommons' geocoder or PostGIS's new Tiger Geocoder. This is roughly what I know so far (to hopefully prove that I have been doing my homework) :

I realize that at the core of the open source geocoders, there are three main elements.

1.- An address normalizer that takes an arbitrary string and normalizes it (taking the example from here):

normalize_address('address string');

e.g.: SELECT naddy.* FROM normalize_address('29645 7th Street SW Federal Way 98023') AS naddy;

 address | predirabbrev |      streetname       | streettypeabbrev | postdirabbrev | internal | location | stateabbrev |  zip  | parsed
 ---------+-------------+-----------------------+------------------+---------------+----------+----------+-------------+-------+--------
 29645 |               | 7th Street SW Federal  | Way              |               |          |          |             | 98023 |

and:

2.- A geocoder that does some magical fuzzy matching for names where the core algorithm is the Levenshtein Distance.

A good example is the one from the Wikipedia article where it calculates the Levenshtein distance between the words kitten and sitting (the distance is 3 since that is the number of edits required to change one string into the other):

kitten → sitten (substitution of 's' for 'k')
sitten → sittin (substitution of 'i' for 'e')
sittin → sitting (insertion of 'g' at the end).

3.- Some interpolation of the street segments at the end to guess where the house is. I downloaded a chunk of the free Census Tiger street dataset to create this example.

street interpolation example

In the example above, the street segment of interest (Schaeffer Hills Dr) has a from node that starts at 300 (so 300 Schaeffer Hills Dr) and a to node that ends on 400 (400 Schaeffer Hills Drv). If I matched to this Schaeffer Hills Drv, and the request was for street 310, then the algorithm would just interpolate to it (traverse 10% of it) to where my green arrow is.

This is what the Open Source geocoder tools do. Nevertheless, Google is clearly smarter than that and uses all kinds of non-traditional hints.

How so?

For example, I can type 680 Mission st (no city, state, county, anything at all). Most of the standard address normalizers would blow up because they would find too many matches. But since I am in SF, I am guessing google uses my ip to get some geoip-like information, does some expanding bounding as a hint with some fuzzy search, and right away finds the closest segment that matches and tells me that's my answer (which is correct!).

I am looking for answers that can shed some more light into how the Google geocoder works besides the techniques that I described above.

Update:

OK, so far we have two kinds of hints listed

Geoip as hints
Area of Interest Bounding Box (see Paul's example).
Others?

I suspect no-one with accurate information is going to be able to answer your question without violating a confidentiality agreement. — Jon Skeet, Jun 12 '12 at 06:11
you may have better luck asking this question here: http://gis.stackexchange.com/ — Suvi Vignarajah, Jun 12 '12 at 13:25
@Suvi I do know about gis.stackexchange. Nevertheless, this forum has orders of magnitude more eyes and I was hoping that could :-/ — rburhum, Jun 12 '12 at 19:25
You could go look at some patents like: http://www.google.com/patents/US20120265778 (by the looks of it its owned by SAP). — Adam Gent, Apr 08 '13 at 20:29

score 7 · Accepted Answer · answered Jun 12 '12 at 21:19

One of the things you can find by poking at the black box is that the Google geocoder isn't totally sensitive to the order of the tokens (there's no enforced street/city/state/country expectation, though it does better when you do follow that). Which says to me that they might be dumping everything into some kind of full text search and then seeing what comes back. Or perhaps not. Try searching "sault saint marie adams 200" and "sault saint marie 200 adams".

With respect to your Mission example, that's a great one, as you can see the map hint coming into play directly:

Query with map window over Europe: European results.

Query with map window over Europe: European results

Query with map window over North America: American results.

Query with map window over North America: American results

score 4 · Answer 2 · answered Jun 12 '12 at 06:17

4

There is another source of data: county property maps. These don't just include roads, but also property lines (and their street addresses). You can often see this on Google's map, it will actually show faint lines that separate adjacent properties. Sometimes they even outline buildings (county maps often include these too).

You can also do the reverse lookup, given your GPS coordinates finding your exact address can be as simple as a 2D query to find which property polygon you're in. I've seen this work properly when I was physically far from the road but still inside the property and it returned the correct street address despite the handset being closer to another street.

Note that these maps tend to be public and some counties even have their own online interface. You can even look up who owns a particular plot.

answered Jun 12 '12 at 06:17

Adam

16,808
7
52
98

Reverse geocoding is a much easier problem. Just get the lat/lon and snap to the closest feature (parcel or street segment). That brings up the topic of Parcel features which are another source of data for doing geocoding. The process for a *traditional* geocode against them is the very very similar to the street segment approach. So my question is still unanswered :( Thanks for pointing that out though. – rburhum Jun 12 '12 at 14:54
1

Thanks for clarifying my answer as incorrect, it's now removed. I thought the Google Geolocation White Paper was also discussing potential infrastructure related to geocoding that might somehow be relevant. To be sure, +1 for your answer. Cheers! – arttronics Jun 16 '12 at 22:38

How does the Google geocoder work?

2 Answers2