While I have worked with large datasets in the past, never with one quite so huge. Here's the breakdown:
- Every month I will receive a 5GB flat file containing 1 database record per row
- This file contains approximately 10 million rows of data, with each subsequent month containing more data than the previous
- The data may have exact duplicates which need to be ignored
- The data may also have inexact duplicates, which need to be analyzed and only one of the rows needs to be inserted, the rest discarded
I need to setup an initial load for 10 years worth of data, which is approximately 1.5 Billion rows. I have no problem with the database handling this, but I am having difficulty loading this data quickly.
I am currently inserting the rows one at a time, allowing Primary Key contraints to let me know if I have a row that needs to be reviewed, those go in a separate table.
Loading a 5GB file takes approximately 10 hours, and this is the fastest I can get it. I have it setup to preprocess about 25,000 rows at a time, then I loop through them via a single statement like so:
public void saveBatchSql(String[] sql) { try { Statement stmt = dbCon.createStatement(); for (String query : sql) { try { stmt.execute(query); } catch (Exception ee) { query = query.replace("component_detail", "component_detail_duplicate"); stmt.execute(query); } } stmt.close(); } catch (Exception e) { e.printStackTrace(); } }
So far, this is the best I can come up with. Does anyone have any ideas on how I can get this to work faster? I can't process all the records in memory as far as I can tell, but maybe there is some other way?
By the way, the database I am loading to is Sybase version 16.