I got some work to extract data from a big CSV file. It has a lot data related to articles and publishers. So I want to make a parser for it in Java to make a data warehouse and show this data in OLAP. Can anyone tell that how can I get started with it?
1 Answers
First off see if you can do it without programming (good programmers are notoriously lazy - why break tradition). See if the provider of your data warehouse allows importing csv data. For example in Oracle you can import csv files via sql developer.
If that is not possible (say a single line of csv will end up in multiple tables) then I would start off with a test program. Write objects for all the data that will be populated by the csv file (article, magazine, publisher, author, etc) and an uber object that contains an instance of each (multiple if a single line has multiple) and the csv line itself. Have an interface for reading the file and return an list of your uber object, and an interface for writing the list.
Then create objects that implement the read interface that uses the common solutions - java split, opencsv, univelocity-parsers, apache commons csv, supercsv. And a service that takes the list of uber objects and writes the contents (original line then parsed content) to a text file.
Then write a main java app that will read one of your csv files and for each of the read methods read the file and output it to a different flat file for each type. If one fails see if you can configure it to work or if it becomes too annoying drop it from your list. At some point you will come down to a short list of parsers you like and all their output files are the same (so they all work or they all failed ). At that point pick the one you like the best.
At this point replace the write with an object that writes to the database and modify the read so it reads one record at a time so you won't run out of memory when dealing with large files and you are done.
:)

- 975
- 7
- 13