0

I have a scenario where I need to have roughly 50-60 different process running concurrently and executing a task.

Every process must fetch the data from DB using a sql query by passing a value and fetching data to be run against in the subsequent task. select col_1, col_2, col_3 from table_1 where col_1 = :Process_1;

 @Bean
    public Job partitioningJob() throws Exception {
        return jobBuilderFactory.get("parallelJob")
                .incrementer(new RunIdIncrementer())
                .flow(masterStep())
                .end()
                .build();
    }

    @Bean
    public Step masterStep() throws Exception {
        //How to fetch data from configuration and pass all values in partitioner one by one.
        // Can we give the name for every process so that it is helpful in logs and monitoring.
        return stepBuilderFactory.get("masterStep")
                .partitioner(slaveStep())
                .partitioner("partition", partitioner())
                .gridSize(10)
                .taskExecutor(new SimpleAsyncTaskExecutor())
                .build();
    }

    @Bean
    public Partitioner partitioner() throws Exception {
        //Hit DB with sql query and fetch the data.

    }

    @Bean
    public Step slaveStep() throws Exception {
        return stepBuilderFactory.get("slaveStep")
                .<Map<String, String>, Map<String, String>>chunk(1)
                .processTask()
                .build();
    }

As we have Aggregator and parallelProcessing in Apache Camel, does Spring Batch has any similar feature which does the same job?

I am new to Spring Batch and currently exploring whether it can handle the volume. As this would be a heavy loaded application running 24*7 and every process needs to run concurrently where every thread should be able to support multiple threads inside a process.

Is there a way to monitor these processes so that it it gets terminated anyhow, I should be able to restart that particular process? Kindly help to give some solution to this problem.

Artem Bilan
  • 113,505
  • 11
  • 91
  • 118
djyo02
  • 27
  • 1
  • 11

1 Answers1

1

Please find the answers of above questions.

  1. parallelProcessing - Local and Remote partition supports parallel processing and can handle huge number of volumes as we are currently handling 200 to 300 million data per day.

  2. Is it can handle the volume - Yes, this can handle huge volumes and is well proven.

  3. Every process needs to run concurrently where every thread should be able to support multiple threads inside a process - Spring batch will take care based on your ThreadPool. Make sure you configure the pool based on System resources.

  4. Is there a way to monitor these processes so that it it gets terminated - Yes . Each parallel process of partition is a step and you can monitor in BATCH_STEP_EXECUTION and have all the details

  5. Should be able to restart that particular process - Yes this is a built in feature and restart from failed step . Huge volume jobs we always use Fault tolerance so that rejections will process later. This is also built in feature.

Example project below

https://github.com/ngecom/springBatchLocalParition/tree/master

Database added - H2 and create table available in resource folder . We always prefer to use Data source pooling and pool size will be greater than your thread pool size.

Summary of the example project

  1. Read from table "customer" and divide into step partitions
  2. Each step partition write to new table "new_customer"
  3. Thread pool config available in JobConfiguration.java method name "taskExecutor()"
  4. Chunk size available in slaveStep().
  5. You can calculate memory size based on your parallel steps and configure as VM max memory.

Query help you analyze based on your above questions after executing

SELECT * FROM NEW_CUSTOMER;   
SELECT * FROM BATCH_JOB_EXECUTION bje;
SELECT * FROM BATCH_STEP_EXECUTION bse WHERE JOB_EXECUTION_ID=2; 
SELECT * FROM BATCH_STEP_EXECUTION_CONTEXT bsec WHERE STEP_EXECUTION_ID=4; 

If you want to change to MYSQL add below as datasource

spring.datasource.hikari.minimum-idle=5 
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.idle-timeout=600000 
spring.datasource.hikari.max-lifetime=1800000 
spring.datasource.hikari.auto-commit=true 
spring.datasource.hikari.poolName=SpringBoot-HikariCP
spring.datasource.url=jdbc:mysql://localhost:3306/ngecomdev
spring.datasource.username=ngecom
spring.datasource.password=ngbilling

Please refer always to below guthub URL. You will get lot ideas from this.

https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples

Rakesh
  • 658
  • 6
  • 15
  • Thanks for the help and building confidence in the tool. I am facing issues while passing data from partition to step. I need to execute a query in partition and based on that output that many number of process needs to be created with that output being used inside these process. A sample project would be really helpful. – djyo02 Feb 12 '21 at 09:15
  • I will modify above answer and add a sample project – Rakesh Feb 12 '21 at 13:19
  • Thanks for the link Rakesh. It is quite helpful for me as I am exploring Spring Batch. – djyo02 Feb 15 '21 at 06:58
  • However, I have one scenario which I would like to get your opinion. As these jobs would be cron based, depending upon situations if I need to run another instance of this application then how do I make sure that they are not duplicated. – djyo02 Feb 15 '21 at 07:01
  • @Override public Map partition(int threadCorePoolSize) { List min = jdbcTemplate.queryForList("SELECT DISTINCT wo.col_name from table_name wo", String.class); Map result = new HashMap<>(); for(String site : min) { ExecutionContext context = new ExecutionContext(); context.putString(DEFAULT_KEY_NAME, site); result.put(site, context); } return result; }....... This is my partition logic so that I can have some 40-50 threads running parallel. – djyo02 Feb 15 '21 at 07:04
  • How to avoid duplicated partitions if I run multiple instances at run time? – djyo02 Feb 15 '21 at 07:05
  • For every Job instance pass a JOB parameter that select only that value from the table partition. So each instance will process that specific selection parameter where clause. – Rakesh Feb 15 '21 at 07:18
  • I guess that is something I need to take care via deployment model. However, I have a query - let's say the cron job is scheduled for every 5mins and there are total 5 workers after partition. If one of the worker continued to run for more than 5min, is there a way that the during the next cron run only other 4 worker shall run and this one should wait until the previous run completes. – djyo02 Feb 16 '21 at 08:52
  • Every scheduled run starts only after completing all workers of previous run. This you can plan in partitioning based on volume and how much time to process. – Rakesh Feb 16 '21 at 08:59
  • Based on the data and load sometimes a worker might finish fast and sometimes it may delay. But the cron exp to start the job cannot change frequently. Waiting for all workers is good but can selective workers wait until last iteration? In my case, the workers are defined to be mutually exclusive so a particular worker should wait and other should proceed normally. Can we check the JOB STATUS at individual Worker level and make the execution wait? – djyo02 Feb 16 '21 at 09:08
  • We have gone through same situation in one project to process large number of files. In our case 90% of files are less than 2MB and workers will finish fast in seconds but 10% files are above 800MB and is holding the Job to complete. What we done was create a separate job instance and route bigger files to this job partition and assign huge memory. In this case, smaller files is processed by 1 Job and bigger files is processed by another job. – Rakesh Feb 16 '21 at 09:35