Best practices for loading extremely large datasets into a relational database via Java

Question

While I have worked with large datasets in the past, never with one quite so huge. Here's the breakdown:

Every month I will receive a 5GB flat file containing 1 database record per row
This file contains approximately 10 million rows of data, with each subsequent month containing more data than the previous
The data may have exact duplicates which need to be ignored
The data may also have inexact duplicates, which need to be analyzed and only one of the rows needs to be inserted, the rest discarded

I need to setup an initial load for 10 years worth of data, which is approximately 1.5 Billion rows. I have no problem with the database handling this, but I am having difficulty loading this data quickly.

I am currently inserting the rows one at a time, allowing Primary Key contraints to let me know if I have a row that needs to be reviewed, those go in a separate table.

Loading a 5GB file takes approximately 10 hours, and this is the fastest I can get it. I have it setup to preprocess about 25,000 rows at a time, then I loop through them via a single statement like so:

public void saveBatchSql(String[] sql) {
    try {
        Statement stmt = dbCon.createStatement();

        for (String query : sql) {
            try {
                stmt.execute(query);
            }
            catch (Exception ee) {
                query = query.replace("component_detail", "component_detail_duplicate");
                stmt.execute(query);
            }
        }

        stmt.close();
    }
    catch (Exception e) {
        e.printStackTrace();
    }

}

So far, this is the best I can come up with. Does anyone have any ideas on how I can get this to work faster? I can't process all the records in memory as far as I can tell, but maybe there is some other way?

By the way, the database I am loading to is Sybase version 16.

You could try batching the statements. BTW: preprocessing 25,000 rows at a time might be inefficient from a memory perspective. — Mark Rotteveel, May 12 '18 at 16:25
Have you analyzed if it's possible partitioning the table, and then splitting the input data in blocks and insert them through multithread? You could also add a checksum hash/key for every row to check fastly if data are changed or not. — IgrewupwithSlackware, May 12 '18 at 16:33
I think the first step should be to profile the running code with a profiler. I'd bet the bottlenecks are nowhere near where you think they are. — markspace, May 12 '18 at 16:33
@markspace I doubt profiling will produce any significant results as the bottlneck isn't Java code but the database. — lexicore, May 12 '18 at 16:37
Maybe also check this link which looks to do what you need to https://stackoverflow.com/questions/2112395/sybase-jconnect-enable-bulk-load-usage — Rich Campbell, Jun 06 '18 at 12:25

score 1 · Answer 1 · answered May 12 '18 at 16:36

1

This depends on the database, but usually the recommend way is to use database tools to load data from some simple format like CSV. Depending on the database you may need to drop indices before and recreate them after importing. Database documentation often has some hints on how to make bulk data loading performant.

I would advise against using Java for bulk loading. What for?

Just for loading? Database tools are optimized for that, you won't be able to do it better in Java.

Data processing while loading? Well, usually you can do most of the data processing with SQL in the database. So load the data into the incoming table and run SQL queries on it. It is much more flexible compared to hardcoding processing logic in Java. In most cases it will also be more efficient as databases are quite clever about optimizing SQL queries/statements.

answered May 12 '18 at 16:36

lexicore

42,748
17
132
221

You are right about efficiency, but I am trying to provide my client with a means to upload this himself via a utility. There are also other functions they need on top of loading, but I guess I may be able to split off the load into a sybase specific format. I will look into this. There's also some data massaging that takes place, as well as some validation, which may prove tricky in a trigger, which may in fact make it much slower when the the data hits in the hundreds of millions of rows. I guess some experimentation is in order. – Dan G May 13 '18 at 16:44
Maybe check this thread re bulk load (i.e. BCP) option via Java : https://stackoverflow.com/questions/2112395/sybase-jconnect-enable-bulk-load-usage – Rich Campbell May 22 '18 at 15:39
1

@RichCampbell Please comment on the question, I think the OP wasn't notified of your comment. – lexicore May 22 '18 at 15:54

score 1 · Answer 2 · answered May 12 '18 at 20:28

Assuming you're talking about Sybase ASE 16, work with your DBA to monitor your loads to see where you're experiencing bottlenecks.

Primary emphasis would be on capturing/analyzing MDA data (master..mon% tables); key items to look for are wait events/timings as well as performance of the individual inserts (cpu/memory usage, logical IOs, etc).

Figure out if you're having to update a lot of indexes ... or perhaps causing RI/foreign key checks to fire ... or perhaps causing triggers to fire ... all of these can greatly degrade the throughput of your inserts.

Some general design ideas ...

1) use prepared statements (and make sure you re-use the prepared statement vs creating a new prepared statement for each insert) - objective is to eliminate the overhead of having the database compile each insert statement

2) batch your inserts (ie, wrap in a begin/commit tran) and test to see what size works best for you - objective is to cut down on the number of relatively-slow log writes the db has to perform for each transaction

3) since reads (from your file) are likely faster than writes (to the database), look at having your reader thread feed into multiple/parallel writer threads - objective being to make sure you're not bottlenecking on writes

4) parllelize your reads (with each reader feeding into multiple/parallel writers), eg, have separate reader threads read from different data files

5) as lexicore has mentioned, see if you can use ASE's bcp (OS-level) utility to assist with bulk loading your data

Thank you, that gives me much to look into. Greatly appreciated! — Dan G, May 13 '18 at 16:46

Best practices for loading extremely large datasets into a relational database via Java

2 Answers2