2

We have a requirement to read 100 million records ,process them and then insert them into a DB2 table as part of our application development.

We use index on primary keys, but not sure if the insertion process will take long time like in hours.

We are doing hashing on field names of the target table for distribution of data across different nodes. so if we think of partition table/hashing of field data, it would not help for insertion I guess

I would like to know what are available options to handle 100 million records insertions into DB2 database table efficiently using Java. we are using spring batch(batch of 9k records at a time).

Thanks in advance.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
techi
  • 133
  • 2
  • 15
  • This question is probably a better fit on [dba.se]. I'm flagging it for migration. – Jim Garrison Mar 11 '16 at 16:39
  • Ok thanks Jim Garrison, is there anything else we can do it java side? – techi Mar 11 '16 at 16:43
  • So, have you done any tests, taken any measurements? How long does it take to insert your 100M records with your current code? Do you observe high network traffic, high CPU / IO load on the DB, any load on the Java side? – vagelis Mar 11 '16 at 16:44
  • @vagelis,yes currently we are handling 5 million records its taking around 45 minutes , we are running this as job daily. I could not check CPU load as I am a developer in a team. will check with others – techi Mar 11 '16 at 16:46
  • 1
    usually it's much more efficient to load all your data into RDBMS DB using one of tools available for that particular RDBMS (i guess it would be load utility for DB2) and then process your data in the database. I don't know whether it's possible in your case, but in my work history it was always possible and was always __much__ faster, compared to application row-by-row approach – MaxU - stand with Ukraine Mar 11 '16 at 16:51
  • What kind of processing do you do on the data, and in what way? Do you process all the data in advance and then send it to the DB, or process and send one row at a time? – vagelis Mar 11 '16 at 16:55
  • Thanks @MaxU for info. its not possible in my case, as I have to invoke other java services in between read and write back to db2 – techi Mar 11 '16 at 17:03
  • @vagelis, once we read, we will invoke other in house engine(other jar) to calculate and get some ratio values (will send in 9k as batch) which has custom java logic and then send final results of 9k as batch to persistance to db2 using java spring batch – techi Mar 11 '16 at 17:05
  • So, to get this straight, you read 100M records in one go, then cut them in 9K batches, and process and insert each 9K batch, correct? – vagelis Mar 11 '16 at 17:14
  • @vagelis,i read 9k, process 9k and insert 9k at one time ,and repeat this until all 100M are read,processed and inserted to db2 .currently we are doing this for 5M rows, thinking how to handle 100M case – techi Mar 11 '16 at 17:18
  • 2
    One important tip: when inserting huge data like this, don't create the indexes (and don't define the primary key, it's implemented with an index) until after you have loaded all the data. On many databases, it's better to also delay the creation of any triggers and rules until after the insertion is done. – RealSkeptic Mar 11 '16 at 17:20
  • If you measure the time to read, process and insert a 9K batch, what percentage of the total time (for a 9K batch) does each step take? Make sure you measure the times for all batches and compute the averages. – vagelis Mar 11 '16 at 17:34
  • 1
    Adding to what @MaxU has to say, I've created DB2 rows in unload / load format and written the rows to QSAM (flat) files. As a final step, I've dropped all the indexes, used the load utility to load the rows, then rebuilt all the indexes. It didn't matter that it was faster. What mattered was that we didn't have to lock the tables until we were ready to load. – Gilbert Le Blanc Mar 11 '16 at 18:16
  • What is the throughput requirement? How are you inserting the 9k records – in a single `batchUpdate()` or individually for each record? Are you able to parallelize the process? – Ian Bjorhovde Mar 11 '16 at 19:13
  • Thanks All, but its not one time that I can insert data without indexes ,its daily batch job, moreover after this read and write ,there is dependent spring batch tasks which would be reading records in batch 9k records which were persisted in earlier step, and this 2nd step task also does read 9k,process 9k and write 9k. The reason we are doing it in two steps is we need to aggregate data after each data(its mandatory). so I think it is not possible to automate disable indexes and run the job and then after job finished enable indexes and then start 2nd step... – techi Mar 11 '16 at 19:14
  • @IanBjorhovde, we are using spring batch (yes it use/s batchUpdate() for 9k records). can you elaborate what do you mean by paralleluze the process? – techi Mar 11 '16 at 19:17
  • 1
    @techi Sorry, I meant does your writer process use the JDBC `addBatch()` and `executeBatch()` calls. For parallelizing, do you have multiple separate threads writing to the database (i.e. partition the input and have multiple threads processing / writing)? Also, what is your throughput requirement? How fast does this need to run? – Ian Bjorhovde Mar 11 '16 at 19:44
  • @IanBjorhovde, as i mentioned earlier. we are using Spring batch jdbctitemwriter ,intern i think it executes in batches only. at present we are not supposed to do multi threading. but even if we use it, one thread has to wait for other thread to complete the write to db table . because locks will be applied to table (dependents on which node thee table is created.) How did you do it in this. – techi Mar 11 '16 at 21:55
  • @techi With a properly designed database, multiple threads inserting in parallel should not be an issue. Unless 15 hours is a reasonable time to process your 100M rows, you'll need to parallelize or further optimize your database and/or code. – Ian Bjorhovde Mar 14 '16 at 21:37

1 Answers1

1

OPTION-1 (Java): I think you can try to look into partitioning your source table. That will help a lot, but it comes with some sacrifice like you wont be able to update the column you choose to partition on. Also you have to choose the right column to partition. You dont want to end up with 1000's of partitions. Instead you want to partition on segragated data that may not have any relation. For example, if have a column CITY in your source you may try partition on CITY. So, you end up with may be 50 partitions if you have 50 cities with the data you have. It helps a lot. You try to partition on date range or quarter of the year. These did help me a lot to select data and also you may look into multithreading on each partition.

OPTION-2 (STORED PROCEDURE (SP)): You may try to use stored procedure (PL/SQL) in your database so your dont have any Java concerns. SP's are 100x faster then even batches from Java. But still I would recommend to partition source tables if possible even if you use SP's.