6

I use the following logic to restart the Spring Batch uncompleted(for example after application abnormal termination) jobs:

public void restartUncompletedJobs() {

    LOGGER.info("Restarting uncompleted jobs");

    try {
        jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));

        List<String> jobs = jobExplorer.getJobNames();
        for (String job : jobs) {
            Set<JobExecution> runningJobs = jobExplorer.findRunningJobExecutions(job);

            for (JobExecution runningJob : runningJobs) {
                runningJob.setStatus(BatchStatus.FAILED);
                runningJob.setEndTime(new Date());
                jobRepository.update(runningJob);
                jobOperator.restart(runningJob.getId());
                LOGGER.info("Job restarted: " + runningJob);
            }
        }
    } catch (Exception e) {
        LOGGER.error(e.getMessage(), e);
    }
}

This works fine but with one side effect - it doesn't restart the failed job execution but creates a new execution instance. How to change this logic in order to restart the failed execution from the failed step and do not create a new execution ?

UPDATED

When I try the following code:

public void restartUncompletedJobs() {
try {
    jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));

    List<String> jobs = jobExplorer.getJobNames();
    for (String job : jobs) {

    Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(job);

    for (JobExecution jobExecution : jobExecutions) {
        jobOperator.restart(jobExecution.getId());
    }
    }
} catch (Exception e) {
    LOGGER.error(e.getMessage(), e);
}
}

it fails with the following exception:

2018-07-30 06:50:47.090 ERROR 1588 --- [           main] c.v.p.d.service.batch.BatchServiceImpl   : Illegal state (only happens on a race condition): job execution already running with name=documetPipelineJob and parameters={ID=826407fa-d3bc-481a-8acb-b9643b849035, inputDir=/home/public/images, STORAGE_TYPE=LOCAL}

org.springframework.batch.core.UnexpectedJobExecutionException: Illegal state (only happens on a race condition): job execution already running with name=documetPipelineJob and parameters={ID=826407fa-d3bc-481a-8acb-b9643b849035, inputDir=/home/public/images, STORAGE_TYPE=LOCAL}
    at org.springframework.batch.core.launch.support.SimpleJobOperator.restart(SimpleJobOperator.java:283) ~[spring-batch-core-4.0.1.RELEASE.jar!/:4.0.1.RELEASE]
    at org.springframework.batch.core.launch.support.SimpleJobOperator$$FastClassBySpringCGLIB$$44ee6049.invoke(<generated>) ~[spring-batch-core-4.0.1.RELEASE.jar!/:4.0.1.RELEASE]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) [spring-core-5.0.6.RELEASE.jar!/:5.0.6.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:684) [spring-aop-5.0.6.RELEASE.jar!/:5.0.6.RELEASE]
    at org.springframework.batch.core.launch.support.SimpleJobOperator$$EnhancerBySpringCGLIB$$7659d4c.restart(<generated>) ~[spring-batch-core-4.0.1.RELEASE.jar!/:4.0.1.RELEASE]
    at com.example.pipeline.domain.service.batch.BatchServiceImpl.restartUncompletedJobs(BatchServiceImpl.java:143) ~[domain-0.0.1.jar!/:0.0.1]

The following code creates new executions in jobstore database:

public void restartUncompletedJobs() {
try {
    jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));

    List<String> jobs = jobExplorer.getJobNames();
    for (String job : jobs) {

    Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(job);

    for (JobExecution jobExecution : jobExecutions) {

        jobExecution.setStatus(BatchStatus.STOPPED);
        jobExecution.setEndTime(new Date());
        jobRepository.update(jobExecution);

        Long jobExecutionId = jobExecution.getId();
        jobOperator.restart(jobExecutionId);
    }
    }
} catch (Exception e) {
    LOGGER.error(e.getMessage(), e);
}

}

The question is - how to continue to run the old uncompleted executions without creating new ones after application restart?

PAA
  • 1
  • 46
  • 174
  • 282
alexanoid
  • 24,051
  • 54
  • 210
  • 410
  • Please don't duplicate questions. See my answer in https://stackoverflow.com/questions/51568654/spring-batch-correctly-restart-uncompleted-jobs-in-clustered-environment#51579529 – Mahmoud Ben Hassine Jul 29 '18 at 10:51
  • This is not duplication - this question relates to a simple restart of uncompleted jobs from the same execution and step. The linked question relates to restart of uncompleted jobs in the clustered environment. I apologize, but your answer doesn't solve the described issue there. It is not clear how to distinguish uncompleted and not running jobs from uncompleted but running jobs. – alexanoid Jul 29 '18 at 15:45
  • On the single node cluster, after application restart, I'm pretty sure that there is no one running job at this moment and can restart all of them but in the multi-node cluster with a shared job repository - I don't know what exact job is running and what - not. – alexanoid Jul 29 '18 at 15:45
  • Well, if you change the question there, How do you want the answer to still solve the issue described there? – Mahmoud Ben Hassine Jul 29 '18 at 15:53
  • yes, I made a mistake when creating a question and apologized for this in the comments – alexanoid Jul 29 '18 at 15:54
  • `It is not clear how to distinguish uncompleted and not running jobs from uncompleted but running jobs`: If it is running, it means it is not completed yet (either successfully or with a failure), it is still running. If it is not running, it means it is either `COMPLETED` or `FAILED` or `ABANDONED` or `UNKNOWN`\. If this is not clear, please clearly define "uncompleted" according to you. – Mahmoud Ben Hassine Jul 29 '18 at 15:56
  • Under uncompleted I mean not in one of these states - `FINISHED`, `COMPLETED` or `FAILED` or `ABANDONED` before the application was terminated(for example - power cut). And this question is - what should I do in order to continue existing job execution(preferable from the last not completed step in the job flow) after the application restart. – alexanoid Jul 29 '18 at 15:59
  • To be more clear - my SB flow job contains 5 different steps and due to the power cut, the application was terminated on the step #3. How to properly continue the not completed job after the application restart within the same(existing) execution from step #3? – alexanoid Jul 29 '18 at 16:28
  • By default, when you restart a failed execution, the steps that were successfully executed in the previous run will be skipped unless they are marked with `allowStartIfComplete` (more details here: https://docs.spring.io/spring-batch/4.0.x/reference/html/step.html#allowStartIfComplete). So by default, your restarted job execution should continue from step #3. – Mahmoud Ben Hassine Jul 29 '18 at 19:33
  • Thanks! Do I need to do anything in order to restart the failed execution or Spring Batch will automatically detect uncompleted executions and continue execution after application restart? – alexanoid Jul 30 '18 at 06:34
  • I have updated my question with more details – alexanoid Jul 30 '18 at 07:06
  • 1
    _"how to continue to run the old uncompleted executions without creating new ones after application restart?"_ > The design of Spring batch does not work like this. Whenever you submit a job, it create a new execution. However if you are submitting a job with param of a failed job instance, a new Job execution will be created, starting from the previously failed position. – Adrian Shum Jul 30 '18 at 07:33
  • @ Adrian Shum ahh okay, thank you very much! This is finally the answer to my question! I was really confused by this behavior. – alexanoid Jul 30 '18 at 07:37

1 Answers1

15

TL;DR: Spring Batch will always create new Job Execution and will not reuse a previous failed job execution to continue its execution.

Longer answer: First you need to understand three similar but different concept in Spring Batch: Job, Job Instance, Job Execution

I always use this example:

  • Job : End-Of-Day Batch
  • Job Instance : End-Of-Day Batch for 2018-01-01
  • Job Execution: End-Of-Day Batch for 2018-01-01, execution #1

In high-level, that's how Spring Batch's recovery works:

Assuming your first execution failed in the step 3. You can submit the same Job (End-of-Day Batch) with same Parameters (2018-01-01). Spring Batch will try to look up last Job Execution (End-Of-Day Batch for 2018-01-01, execution #1) of the submitted Job Instance (End-of-Day Batch for 2018-01-01), and found that it has previously failed in step 3. Spring Batch will then create a NEW execution, [End-Of-Day Batch for 2018-01-01, execution #2], and start the execution from step 3.

So by design, what Spring trying to recover is a previously failed Job Instance (instead of Job Execution). Spring batch will not reuse execution when you are re-running a previous-failed execution.

Adrian Shum
  • 38,812
  • 10
  • 83
  • 131
  • @Adian can you pls help me out here: https://stackoverflow.com/questions/63713450/spring-batch-error-nosuchjobexception-no-such-job-either-in-registry-or-in-his – john Sep 05 '20 at 05:17