I am using spring batch (4.2.2.RELEASE) together with the spring actuator (2.2.6 RELEASE). Since version 4.2, spring batch provides support for batch monitoring and metrics based on micrometer (https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/monitoring-and-metrics.html).
For example i am able to see with the metric name spring_batch_job
how often a job was executed, its status and duration.
I want to monitor this metric with grafana & prometheus and alert if a job failed in the last xx minutes.
If the spring batch application runs as a service it seems that it sums up all the metrics until the service is stopped. For example if a job was started 12 times in the last hour the metrics output could be the following:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 10.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 354.354538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
So two instances of the mainJob
failed. Assumed in the next hour all 12 jobs will be successful, the metrics output would be:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 22.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 708.704538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
How am i able to check if a job failed in the last xx minutes? Because the following expression would still return the two failed job instances: spring_batch_job_seconds_count{status="FAILED"}[15m]