0

I am in the process of implementing a spring batch job for our file upload process. My requirement is to read a flat file, apply business logic then store it in DB then post a Kafka message.

I have a single chunk-based step that uses a custom reader, processor, writer. The process works fine but takes a lot of time to process a big file.

It takes 15 mins to process a file having 60K records. I need to reduce it to less than 5 mins, as we will be consuming much bigger files than this.

As per https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I understand making it multithreaded would give a performance boost, at the cost of restart ability. However, I am using FlatFileItemReader, ItemProcessor, ItemWriter and none of them is thread-safe.

Any suggestions as to how to improve performance here?

Here is the writer code:-

 public void write(List<? extends Message> items) {
        items.forEach(this::process);
    }
    
  private void process(Message message) {
        if (message == null)
            return;
        try {
           //message is a DTO that have info about success or failure.
            if (success) {
                //post kafka message using spring cloud stream
                //insert record in DB using spring jpaRepository
            } else {
                 //insert record in DB using spring jpaRepository
            }
        } catch (Exception e) {
           //throw exception
        }
    }

Best regards, Preeti

Preeti
  • 222
  • 1
  • 8
  • Before going to multi-threading or partitioning, have you profiled your current job? What is the value of the chunk size? Low values mean a lot of transactions which could be a performance issue. What is the bottle neck of your job? Is it you processing logic or the IO (read/write operations)? Those questions are really important to see if you really need to scale your job, and if yes, which scaling strategy to implement. – Mahmoud Ben Hassine Mar 02 '21 at 10:59
  • Thanks @MahmoudBenHassine for getting back. I have defined chunk size as 500. I did try to log time metrics around reader, writer, processor. Writer was the one taking most of the time. Here are the micrometer stats generated by spring batch:-Writer (spring.batch.chunk.write) statistic: "TOTAL_TIME", value: 766.972706343 Process (spring.batch.item.process) statistic: "TOTAL_TIME", value: 3.238209216 Read (spring.batch.item.read) statistic: "TOTAL_TIME", value: 4.164657738 – Preeti Mar 02 '21 at 19:48
  • Thank you for the updates. Can you share your writer config? Also, which job repository do you use? The default Map-based job repository is probably slowing things down. – Mahmoud Ben Hassine Mar 03 '21 at 16:13
  • Thank you. I am using default MapJobRegistry. Writer implements ItemWriter> . Updated my original post with writer's logic. – Preeti Mar 03 '21 at 16:29
  • The map based job repository can be slow and is deprecated: https://github.com/spring-projects/spring-batch/issues/3780,I recommend using the JDBC based job repository. Moreover, your writer does not seem to use bulk updates: you are issuing a save operation for each item in a loop. You should do something like `saveAll(items)` to save all items at once in a single bulk operation. We introduced similar improvements in 4.3: https://docs.spring.io/spring-batch/docs/4.3.x/reference/html/whatsnew.html#performanceImprovements which you can use for inspiration. – Mahmoud Ben Hassine Mar 03 '21 at 16:51
  • Sorry, I was wrong. I am using JBDC based job repository. However I understand what you are saying, but my requirement is to save each element individually after posting to kafka. – Preeti Mar 03 '21 at 19:34
  • In this case, if you hit the limits and you think your job cannot be optimized further, then you can try a multi-threaded step or a locally partitioned step (ie a thread per worker step). In both cases, you need to make sure your batch artifacts (reader, processor, writer) are thread-safe. – Mahmoud Ben Hassine Mar 03 '21 at 20:10
  • Thanks for quick response. Wanted to pick your brain on another approach I was thinking I noticed if I use CompositeItemWriter i.e.use JpaItemWriter to perform DB commits, and separate writer that post messages to kafka in @Async call. This did help expediting process. However I am not able to envision behavior:- 1. When something goes wrong while posting message to kafka and execution stops. Would it rollback DB inserts? 2. What are the repercussions on restart-ability I understand with Multithreading I would loose restart behavior and partitioning will complicate my flow further. – Preeti Mar 04 '21 at 16:06

1 Answers1

0

Please refer to below SO thread and refer the git hub source code for parallel processing

Spring Batch multiple process for heavy load with multiple thread under every process

Spring batch to process huge data

Rakesh
  • 658
  • 6
  • 15