1

I'm using Spring Batch with a Kubernetes Cron Manager to schedule and run a job. The job involves calling an external API to read, process, and write data. However, when I execute the job with a dataset of 200,000 items, it takes an excessively long time to complete, at least 5 hours.

In my configuration, I have set up a single replica set with a single pod in the Kubernetes cluster, and I have also configured Spring Batch to use 40 task executors for concurrent processing. Despite these settings, the job execution time is still significantly slow.

I would appreciate any insights or suggestions on how to improve the execution time of my job. Are there any specific optimizations or best practices I should consider when working with Spring Batch and Kubernetes Cron Manager in such scenarios?

This is my spring batch kubernetes configuration

resources:
    requests:
      cpu: 8
      memory: 8Gi
    limits:
      cpu: 16
      memory: 16Gi

This is my Task executor

    @Bean
    @StepScope
    public TaskExecutor taskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(40);
        executor.setMaxPoolSize(40);
        executor.setThreadNamePrefix("spring_batch_worker-");
        executor.setWaitForTasksToCompleteOnShutdown(true);
        return executor;
    }


This is my writer

public class Writer<T> implements ItemWriter<String> {
        private static final Logger logger = LoggerFactory.getLogger(Writer.class);

        private final String paymentType;
        private final String type;

        public Writer(String paymentType,
                      String type) {
                this.paymentType = paymentType;
                this.type = type;
        }

        @Autowired
        private TaskExecutor taskExecutor;

        @Autowired
        private Client client;

        @Override
        public void write(List<? extends String> users) throws Exception {
                for (String userId : users) {
                        taskExecutor.execute(() -> {
                                String currentThreadName = Thread.currentThread()
                                                                 .getName();
                                try {
                                        logger.info("action", "repayment_processing_item",
                                                "threadName", currentThreadName,
                                                "userId", userId,
                                                "paymentType", paymentType,
                                                "type", type);
                                        client.makePayment(userId, paymentType, type);
                                } catch (Exception e) {
                                        logger.error("action", "repayment_failed_to_process_item",
                                                "errorMessage", GeneralUtil.getErrorMessage(e),
                                                "threadName", currentThreadName,
                                                "userId", userId,
                                                "paymentType", paymentType,
                                                "type", type);
                                }
                        });
                }
        }
}
Faisal
  • 11
  • 3
  • `I have also configured Spring Batch to use 40 task executors for concurrent processing`: This is probably the issue. Adding more threads does not necessarily mean improving the performance. In fact, it can be counter-productive as your app will spend more time managing threads rather than doing the business logic. Can you share what you app is doing and how it is designed to distribute the work among threads? Without the code, we can't really help you efficiently. – Mahmoud Ben Hassine Jun 19 '23 at 09:19
  • I have updated in the question – Faisal Jun 19 '23 at 11:15
  • Your architecture is confusing; you're using a non-distributed application on a distributed architecture. You're using kubernetes as if it were a simple VM. It isn't. You should write single-threaded applications and let kubernetes take care of the parallelism for you, across many smaller machines. If you want 40 threads, create 40 pods across a handful of smaller nodes. If you need to correlate the results from each 'fetch' then put the results on a bus and have another process (pod) reduce them to your final result-set. – Software Engineer Jun 19 '23 at 11:26
  • if i have 40 pods, that means my data will also be fetched 40 times from external api. But instead i just want to fetch only once and then it can be distributed to 40 pods to do further processing. How will i achieve that. – Faisal Jun 19 '23 at 16:49

1 Answers1

0

What you shared is not 40 task executors, it is one task executor with a pool of 40 threads. That is different.

Moreover, you are submitting items to write as tasks to different threads in the item writer. Technical concerns like concurrency should be handled by the framework, and Spring Batch already provides concurrent processing if you set the task executor on the step itself. The item writer should rather focus on writing items, something like:

@Override
public void write(List<? extends String> users) throws Exception {
   for (String userId : users) {
      client.makePayment(userId, paymentType, type);
   }
}

Now if you want to scale such a process on different pods, you can use a partitioned job where each partition is handled by a Pod.

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
  • i see, thanks for the clarification. Do you have any example regarding this `Now if you want to scale such a process on different pods, you can use a partitioned job where each partition is handled by a Pod.` – Faisal Jun 20 '23 at 15:43
  • There are several way to scale the job on k8s, please check my blog post here: [Spring Batch on Kubernetes: Efficient batch processing at scale](https://spring.io/blog/2021/01/27/spring-batch-on-kubernetes-efficient-batch-processing-at-scale) – Mahmoud Ben Hassine Jun 20 '23 at 16:37
  • Thanks once again. So in my case if i use partitioning, how should i configure my kubernetes. – Faisal Jun 20 '23 at 17:18
  • You can find a similar example here: https://dataflow.spring.io/docs/feature-guides/batch/partitioning/. – Mahmoud Ben Hassine Jun 20 '23 at 18:53
  • Its running fast in my local machine, but when i am deploying the application in kubernetes, the processing time increases drastically. I am using a single pod in kubernetes currently. Do i need any special configurations in my kubernetes? – Faisal Jun 20 '23 at 19:04