0

I've just obtained a large set of text files (8 GB total) containing all of the address ranges within the U.S. The set consists of:

  • 929 ZIP+4 files, each containing postal addresses of unique three-digit zip code. For example, file 606 would only contain addresses that have a five-digit zip code that begins with 606. The total number of records among these files are approximately 30 million.

  • City State file, containing a comprehensive list of zip codes and their corresponding city and state.

The City State Key can be used to join the City State file to the ZIP+4 files.

Given the size of the database and my lack of experience, I wanted to get some insight before beginning this endeavor. Should the ZIP+4 files be merged into one monster file and then indexed using zip code, or left separated by three-digit zip code so that the three-digit zip code file name can be used as a block matching criteria? If it is the latter, then wouldn't this be a hierarchical database model? Can I accommodate relationships with the City State file using a hierarchical model?

The above description of the data set is a vast simplification, but for the purposes of this question, a detailed description is unnecessary. A complete description can be found here.

I'm using Python and have not decided on an RDBMS yet. Any help would be much appreciated!

user1185790
  • 623
  • 8
  • 24

1 Answers1

1

If you're going to use a RDBMS, you will eventually have the contents of all 929 files in one database, most probably in more than table. I cannot tell you much more about the design of such a database since you don't provide enough detail about the contents of each of those files. The exact layout will be a normalized form of your 30 million rows in probably a handful of tables. The performance of modern RDBMS is good enough to handle data of that scale if (and only if) your indexes are properly set.

There is very little reason not to put that data into a RDBMS. The only reason I could think of is to eliminate the need of such a subsystem entirely, e.g. to simplify deployment of your solution. If you actually consider doing that then yes, a set of 929 files could act as a hierarchical database. The main difference to a RDBMS solution is that with such a set of flat files you can only reasonably query your data by one key - that being your zip code (or any part thereof).

Hazzit
  • 6,782
  • 1
  • 27
  • 46
  • Hazzit, that is my understanding of the limitations as well. I could partition addresses with unique five-digit zip codes into individual text files and arrange the text files within directories containing unique three-digit zip codes. Therefore, a search for an address containing the zip code 60601 would search for the 606 directory, and then search for the 60601 text file. But as you mentioned, I would only be able to query using one key - the zip code. In the event of a non-match within the five digit zip code, I would need to find ways to efficiently query by three-digit zip code or city. – user1185790 Jun 14 '13 at 14:21
  • @user1185790 if you have a use case that needs a different key, then you should definately go with a RDBMS. – Hazzit Jun 14 '13 at 17:22
  • Thank you Hazzit! I'll go with the RDBMS and apply composite indexes on the query fields. Perhaps one composite index consisting of Five-Digit Zip Code, Address, another consisting of City, Address, and another consisting of Three-Digit Zip Code, Address. – user1185790 Jun 14 '13 at 17:49
  • @user1185790 Without knowing much about your application, please note: A simple (non-composite) index on the Zip Code automatically acts as index on its first three digits as well. A simple index on City is probably the only other index you'll need. – Hazzit Jun 14 '13 at 18:00