I am curious as to how the Google geocoder works.
I have been studying some implementations of open source geocoders like geocommons' geocoder or PostGIS's new Tiger Geocoder. This is roughly what I know so far (to hopefully prove that I have been doing my homework) :
I realize that at the core of the open source geocoders, there are three main elements.
1.- An address normalizer that takes an arbitrary string and normalizes it (taking the example from here):
normalize_address('address string');
e.g.: SELECT naddy.* FROM normalize_address('29645 7th Street SW Federal Way 98023') AS naddy;
address | predirabbrev | streetname | streettypeabbrev | postdirabbrev | internal | location | stateabbrev | zip | parsed
---------+-------------+-----------------------+------------------+---------------+----------+----------+-------------+-------+--------
29645 | | 7th Street SW Federal | Way | | | | | 98023 |
and:
2.- A geocoder that does some magical fuzzy matching for names where the core algorithm is the Levenshtein Distance.
A good example is the one from the Wikipedia article where it calculates the Levenshtein distance between the words kitten and sitting (the distance is 3 since that is the number of edits required to change one string into the other):
kitten → sitten (substitution of 's' for 'k')
sitten → sittin (substitution of 'i' for 'e')
sittin → sitting (insertion of 'g' at the end).
3.- Some interpolation of the street segments at the end to guess where the house is. I downloaded a chunk of the free Census Tiger street dataset to create this example.
In the example above, the street segment of interest (Schaeffer Hills Dr) has a from node that starts at 300 (so 300 Schaeffer Hills Dr) and a to node that ends on 400 (400 Schaeffer Hills Drv). If I matched to this Schaeffer Hills Drv, and the request was for street 310, then the algorithm would just interpolate to it (traverse 10% of it) to where my green arrow is.
This is what the Open Source geocoder tools do. Nevertheless, Google is clearly smarter than that and uses all kinds of non-traditional hints.
How so?
For example, I can type 680 Mission st (no city, state, county, anything at all). Most of the standard address normalizers would blow up because they would find too many matches. But since I am in SF, I am guessing google uses my ip to get some geoip-like information, does some expanding bounding as a hint with some fuzzy search, and right away finds the closest segment that matches and tells me that's my answer (which is correct!).
I am looking for answers that can shed some more light into how the Google geocoder works besides the techniques that I described above.
Update:
OK, so far we have two kinds of hints listed
- Geoip as hints
- Area of Interest Bounding Box (see Paul's example).
- Others?