How can I match up user inputs to ambiguous city names?

Question

We have a set of tables shown below we use for our other tables to reference for location data. Some examples are:

Find all companies within X miles of X City
Create a company profile's location as X City

We solve the problem of multiple cities with similar names by matching with State as well, but now we ran into a different set of problems. We use Google's Place Autocomplete for both Geocoding and matching up a users query with our Cities. This works fairly well until Google's format deviates from ours.

Example: St. Louis !== Saint Louis and Ameca del Torro !== Ameca Torro

Is there a way to fuzzy match cities in our queries?

Our query to match cities now looks like:

SELECT c.id
FROM city c
INNER JOIN state s
ON s.id = c.state_id
WHERE c.name = 'Los Angeles' AND s.short_name = 'CA'

I've also considered the denormalizing city and simply storing coordinates to still accomplish the radius search. We have around 2 million rows in our company table now so a radius search would be performed on that rather than by city table with a JOIN on company. This would also mean we wouldn't be able to create custom regions (simply anyway) for cities, and add other attributes to cities in the future.

I found this answer but it is basically affirming our way of normalizing input is a good method, but not how we match to our local Table (unless Google offers a City Name export I don't know about).

I also am open to suggestions on changing how we accomplish this as well. How do most company's normalize city data? Or do you just not? — Matt Weber, Jun 27 '19 at 18:34

score 1 · Accepted Answer · answered Jun 27 '19 at 20:24

The short answer is that you can use Postgres's full text search functionality, with a customized search configuration.

Since your dealing with place names, your probably want to avoid stemming, so you can use the simple configuration as a starting point. You can also add stop-words that make sense for place names (with the examples above, you can probably consider "St.", "Saint", and "del" as stop-words).

A pretty basic outline of setting up your customized is below:

Create a stopwords file and put it in your $SHAREDIR/tsearch_data Postgres directory. See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS.
Create a dictionary that uses this stopwords list (you can probably use the pg_catalog.simple as your template dictionary). See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY.
Create a search configuration for place names. See https://www.postgresql.org/docs/9.1/static/textsearch-configuration.html.
Alter your search configuration to use the dictionary you created in Step 2 (cf. the link above).

Another consideration is how to consider internationalization. It seems that the issue for your second example (Ameca del Torro vs. Ameca Torro) might be a Spanish vs. English representation of the name. If that's the case, you could also consider storing both a "localized" and "universal" (e.g. English) version of the city name.

At the end, your query (using full-text search) might look like this (where the 'places' is the name of your search configuration):

SELECT cities."id"
FROM cities
    INNER JOIN "state" ON "state".id = cities.state_id
WHERE
    "state".short_name = 'CA'
    AND TO_TSVECTOR('places', cities.name) @@ TO_TSQUERY('places', 'Los & Angeles')

How can I match up user inputs to ambiguous city names?

1 Answers1