Best way to deal with 900,000 record database and zip codes?

Question

A company we do business with wants to give us a 1.2 gb CSV file every day containing about 900,000 product listings. Only a small portion of the file changes every day, maybe less than 0.5%, and it's really just products being added or dropped, not modified. We need to display the product listings to our partners.

What makes this more complicated is that our partners should only be able to see product listings available within a 30-500 mile radius of their zip code. Each product listing row has a field for what the actual radius for the product is (some are only 30, some are 500, some are 100, etc. 500 is the max). A partner in a given zip code is likely to only have 20 results or so, meaning that there's going to be a ton of unused data. We don't know all the partner zip codes ahead of time.

We have to consider performance, so I'm not sure what the best way to go about this is.

Should I have two databases- one with zip codes and latitude/longitude and use the Haversine formula for calculating distance...and the other the actual product database...and then what do I do? Return all the zip codes within a given radius and look for a match in the product database? For a 500 mile radius that's going to be a ton of zip codes. Or write a MySQL function?

We could use Amazon SimpleDB to store the database...but then I still have this problem with the zip codes. I could make two "domains" as Amazon calls them, one for the products, and one for the zip codes? I don't think you can make a query across multiple SimpleDB domains, though. At least, I don't see that anywhere in their documentation.

I'm open to some other solution entirely. It doesn't have to be PHP/MySQL or SimpleDB. Just keep in mind our dedicated server is a P4 with 2 gb. We could upgrade the RAM, it's just that we can't throw a ton of processing power at this. Or even store and process the database every night on a VPS somewhere where it wouldn't be a problem if the VPS were unbearably slow while that 1.2 gb CSV is being processed. We could even process the file offline on a desktop computer and then remotely update the database every day...except then I still have this problem with zip codes and product listings needing to be cross-referenced.

you should lower your wall of text. As now it's not so clearly what is your problem and what you want to achive — dynamic, Jun 07 '11 at 21:25
possible duplicate of [Fastest distance lookup given latitude/longitude?](http://stackoverflow.com/questions/1006654/fastest-distance-lookup-given-latitude-longitude) — D'Arcy Rittich, Jun 07 '11 at 21:27
I suggest you ask them if it is possible to get a feed of differences, rather than a complete feed. This will probably be less work to process. This may be something they can do easily. — Brian, Jun 07 '11 at 21:33
The problem is how to store 900,000 records that have to be cross-referenced by zip code radius on a server that only has a Pentium 4 with 2 gigs of RAM. Do I upgrade that and try to do this entirely within MySQL, or do I use Amazon SimpleDB? If I do it in MySQL, what's the best way to take into account zip code proximity in a second database, keeping in mind that I can't just return all zip codes within a 500 mile radius without suffering a peformance hit? Or is there some other solution entirely? — Phil, Jun 07 '11 at 21:36
@Phil: edit your first post. Start lowering it. Don't post more wall of text — dynamic, Jun 07 '11 at 21:37
Teach the clients about 'diff' and have them send only the diff output. Better to parse a few hundred K of data instead of 1+ jiggabytes. — Marc B, Jun 07 '11 at 21:40
This isn't just a one-off "how do I write so and so query." I'm looking for advice on the best way to handle all of this data. — Phil, Jun 07 '11 at 21:41
If the client can't/won't provide diffs, it's not exactly difficult to keep the previous day's file around and generate his own diff. — MarkD, Jun 07 '11 at 21:43

score 2 · Accepted Answer · answered Jun 07 '11 at 21:27

2

You might want to look into PostgreSQL and Postgis. It has similar features as MySQL spacial indexing features, without the need to use MyISAM (which, in my experience, tend to become corrupt as opposed to InnoDB).

In particular with Postgres 9.1, which allows k-nearest neighbour search queries using GIST indexes.

answered Jun 07 '11 at 21:27

Denis de Bernardy

75,850
13
131
154

MySQL (at least the version I have: 5.5.13) supports InnoDB as well. – John M. Jun 07 '11 at 21:31
@John: in so far as I'm reading the docs, the spacial functionality exists in all tables, but the indexing (including in 5.5) only works with MyISAM. (This is much like full text indexing.) – Denis de Bernardy Jun 07 '11 at 21:38
@yes: not full text indexes. Come to think of it I'm not sure it supports the operator either, but I'd be surprised if it doesn't. – Denis de Bernardy Jun 07 '11 at 21:44

score 1 · Answer 2 · answered Jun 07 '11 at 21:37

Well, that is an interesting problem indeed.

This seems like its actually two issues, one how should you index the databases and the second is how to you keep it up to date. The first you can achieve as you describe, but normalization may or may not be a problem, depending on how you are storing the zip code. This primarily comes down to what your data looks like.

As for the second one, this is more my area of expertise. You can have your client upload the csv to you as they currently are, keep a copy of the one from yesterday and run it through a diff utility, or you can leverage Perl, PHP, Python, Bash or any other tools you have, to find the lines that have changed. Pass those into a second block that would update your database. I have dealt with clients with issues along this line and scripting it away tends to be the best choice. If you need help with organizing your script that is always available.

Best way to deal with 900,000 record database and zip codes?

2 Answers2