0

I have around 10 million files in my database in blob format which I need to convert and save them in pdf format. Each file size is around between 0.5 - 10mb and combined files size is around 20 TB. I’m trying to implement the functionality using spring batch. However my question is when I run the batch can the server memory hold that much amount of data? I’m trying to use chunk based processing and thread pool task executor. Please suggest if this best approach to run the job to process that much amount of data in less time

Kalsid
  • 15
  • 1
  • 2
  • 7
  • If you use chunks why would all the data need to be in memory? – M. Deinum Feb 11 '21 at 08:24
  • Correct me if I’m wrong..Initially when we read from database item reader will hold complete data in its memory right? – Kalsid Feb 11 '21 at 14:20
  • No, it will only hold the chunk in memory that is being processed. – M. Deinum Feb 11 '21 at 14:49
  • Ok thank you for clarification I’m reading keys from item reader and sending it to processor, it will fetch the blob based on key, will convert and save them into PDFs, item writer writes to another table abt the status. In this case does item processor frees up the memory once it process the data? Or do I need to do GC – Kalsid Feb 12 '21 at 03:47
  • No you don't the GC will be done when needed. – M. Deinum Feb 12 '21 at 18:58

1 Answers1

0

Each file size is 0.5 to 10 MB and approach you mentioned is perfect with chunks. You can get more control with below and monitor the processing.

  1. Create Partition based on thread pool count(Based on your System resource) from file table.
  2. Each partition step of reader will select only 1 file at a time.
  3. You can calculate memory based on number of parallel steps and give as VM argument.
  4. Configure Commit chunk based on memory calculation of total parallel steps.

Please refer below for example code.

Spring Batch multiple process for heavy load with multiple thread under every process

Rakesh
  • 658
  • 6
  • 15
  • Thank you...I think I’m following the above steps however I’m partitioning the data based on no of records equally divided among threads. I’m passing chunk size in app properties file. – Kalsid Feb 11 '21 at 16:06
  • How do I configure the chunk commit size based on memory? Does spring batch offers any? – Kalsid Feb 12 '21 at 03:52
  • Based on number of parallel process and file size you can calculate and put as VM argument during startup. – Rakesh Feb 12 '21 at 07:07