0

I'm working on an application that is location specific -- think of it as a store locator where store owners enter their address information and other users can only see nearby stores within a certain range. However, it's a little different in the sense that an exact location is not required, only the city/state is required (weird, I know). I have thought about the schema for storing locations, and have decided on this one.

Locations

id                      -- int
formatted_address       -- varchar(200)
is_point_of_interest    -- bool
name                    -- varchar(100) -- NULL
street_number           -- varchar(10)  -- NULL
street                  -- varchar(40)  -- NULL
city                    -- varchar(40)
state                   -- varchar(40)
state_code              -- varchar(3)
postal_code             -- varchar(10)
country                 -- varchar(40)
country_code            -- varchar(3)
latitude                -- float(10,6)
longitude               -- float(10,6)
last_updated_at         -- timestamp

Here are some notes about the application:

  • I want to keep the door open for international locations
  • I plan to use a geocoding service to search for and validate the locations specified by the store owner
  • I truly only need the lat/lon, but the other data is necessary for displaying store information
  • The formatted_address field will contain the fully formatted address -- e.g., Giants Stadium, 50 NJ-120, East Rutherford, NJ 07073, USA -- to allow for easier searching of stored locations
  • There will possibly be a lot of duplicate fields, because each row may have a different level of granularity -- for instance, 123 Main Street, City, State 12345 is different from Main Street, City, State 12345 because one has a specified street number and the other doesn't

I understand that the schema is not very normalized, but I also do not see the need to normalize it any more because locations are very complex, which is why I'm relying on a stable geocode service (google). Also, I plan to allow freeform text input/search, so theres no need for any dropdown lists.

Does anybody see anything wrong or have any improvements, taking into consideration what I've mentioned? I can see this table growing rather large.

Matt
  • 22,721
  • 17
  • 71
  • 112
BDuelz
  • 3,890
  • 7
  • 39
  • 62
  • Postal codes uniquely identify a city, state/province, and country. I would remove those fields and make them their own table. Give each postal code a surrogate key unique integer ID, in case two countries share a code. – Yuck Jan 06 '12 at 20:07
  • Sign posts? What do you mean? – BDuelz Jan 06 '12 at 20:07
  • @Yuck, I looked into it and saw that not all postal codes uniquely identify a city, as some span multiple cities. – BDuelz Jan 06 '12 at 20:09
  • 1
    Since the `PostalCode` table would have a surrogate key that's fine; they can be duplicated. I'd still normalize your design by moving those fields. You will otherwise have update anomalies. – Yuck Jan 06 '12 at 20:13

2 Answers2

1

Since you say "I truly only need the lat/lon" I encourage you to use 2 tables with a 1:1 relationship.

In many (most?) cases a LOT more of lat/lon pairs will be cached, speeding up your workhorse. If you need the additional info, get it when you need it.

Short form: Dont force the DB to move data you don't need through IO and RAM

Additionally, such a schema would keep your doors open for further natural expansion: Linking other info can be done by adding other tables rather than altering existing ones. I consider this a good thing for your SW quality.

Eugen Rieck
  • 64,175
  • 10
  • 70
  • 92
  • I'm sorry, maybe I should have been more clear. While the lat/lon is important for the distance calculations, the other data is necessary to display store information. – BDuelz Jan 06 '12 at 20:17
  • I'm not entirely sure what those 2 tables would be, and if there is a 1:1 relationship, I don't understand the point of it. Can you explain? – BDuelz Jan 06 '12 at 20:18
  • Same question again: Do you do most of your work on lat/lon? If yes, separate them to profit from cache. If not, combine them (i.e use as suggested in your question) – Eugen Rieck Jan 06 '12 at 20:18
  • 1st Table: id,lat,lon; 2nd Table rest of fields and id (join field to 1st table) – Eugen Rieck Jan 06 '12 at 20:19
  • Keeping the lat/long in a separate (MyISAM) table while all the others (or all, including lat/long) in InnoDB is also good if you want to create a spatial index for geo searches (what is nearby, nearest restaurant, bar, town, gas station, etc). – ypercubeᵀᴹ Jan 06 '12 at 20:21
1

I do not think so. Here is my two-minute synopsis:

This very badly normalized. At least city->country should be moved out to a different table (and normalized from there). I believe postal codes can cross city boundaries though (or I am very badly misremembering); I am not aware of such a city that crosses a state boundary.

formatted_address is an "optimization" and should likely be a computed field: that is, all the data to re-create it should exist elsewhere. (This means that it doesn't need to worried about now.)

Happy designing.


The simple "more-normalized" form just doing the above proposed:

LOCATIONS
location_id             -- int
is_point_of_interest    -- bool
name                    -- varchar(100) -- NULL
street_number           -- varchar(10)  -- NULL
street                  -- varchar(40)  -- NULL
city_id                 -- int
postal_code             -- varchar(10)
latitude                -- float(10,6)
longitude               -- float(10,6)
last_updated_at         -- timestamp

CITIES
city_id                 
name                    -- varchar
-- similarly, the following should be normalized to STATES and COUNTRIES
state                   -- varchar(40)
state_code              -- varchar(3)
country                 -- varchar(40)
country_code            -- varchar(3)

Of course, CITIES can be further normalized, and so could a POSTALS table: I don't know enough about postal codes, or the application domain though. postal_code acts as part of an implicit compound-surrogate-FK so it's not super terrible as it is there. However, moving it into a separate table could easily allow verification and integrity constraints.

Edit: Normalizing a POSTALs table would be best, as only a very samll number of postal codes are valid for a given city: I am not sure the relation between a postal code and a city, though, so I can't recommend how to do this. Perhaps look at existing schemas used?

  • I understand that it is poorly normalized, however I feel as though normalizing everything would be a pure headache for the reasons I mentioned above -- very complex. Also, although not a major concern, because this table could grow rather large, all those joins will hurt... – BDuelz Jan 06 '12 at 20:29
  • 2
    **No. No. No.** The headache will be from *not normalizing* - it is *easy* to turn a normalized group of table into a de-normalized table later. As for performance: **there is no performance problem until you have a performance test case**. Considering that RDMBS have *been designed for this "join" behavior and do so **really well** on indicies*, means, if anything, you may be slowing it down with more page reads! –  Jan 06 '12 at 20:31
  • okay... if it's not too much, can you propose a schema, just of the tables you feel are necessary? – BDuelz Jan 06 '12 at 20:32
  • 1
    I agree with this but lat/long should be in `Location` table. – ypercubeᵀᴹ Jan 06 '12 at 20:41
  • @ypercube Doh, missed that :) –  Jan 06 '12 at 20:44
  • I agree with this... it's actually the schema I initially thought of (fully normalized). the reason I initially decided against it is because of the headache to insert a new address. I would need to first check that the country is in the db, and if not create it, then check if the state is in the db, and if not create it, so on down until I have the full chain. With all of the complexities involved in addresses, I wanted to stay away. Anyways, I see that this is the best way to go (normalized) so I will work it out. – BDuelz Jan 06 '12 at 21:05
  • Also, formatted_address, while most likely a computed field, is returned by googles geocode service, so I just thought to keep it. – BDuelz Jan 06 '12 at 21:18
  • @BDuelz See what contract it has with googles geocode: if it *contains data that would otherwise be lost*, then it might be worth keeping. (Or rather, the additional data extracted, depending.) –  Jan 06 '12 at 23:26