I've just obtained a large set of text files (8 GB total) containing all of the address ranges within the U.S. The set consists of:
929 ZIP+4 files, each containing postal addresses of unique three-digit zip code. For example, file 606 would only contain addresses that have a five-digit zip code that begins with 606. The total number of records among these files are approximately 30 million.
City State file, containing a comprehensive list of zip codes and their corresponding city and state.
The City State Key can be used to join the City State file to the ZIP+4 files.
Given the size of the database and my lack of experience, I wanted to get some insight before beginning this endeavor. Should the ZIP+4 files be merged into one monster file and then indexed using zip code, or left separated by three-digit zip code so that the three-digit zip code file name can be used as a block matching criteria? If it is the latter, then wouldn't this be a hierarchical database model? Can I accommodate relationships with the City State file using a hierarchical model?
The above description of the data set is a vast simplification, but for the purposes of this question, a detailed description is unnecessary. A complete description can be found here.
I'm using Python and have not decided on an RDBMS yet. Any help would be much appreciated!