How to handle transient/application failures in Apache Flink?

Question

My Flink processor listens to Kafka and the business logic in processor involves calling external REST services and there are possibilities that the services may be down. I would like to replay the tuple back into the processor and Is there anyway to do it? I have used Storm and we will be able to fail the tuple so that the the tuple will not be acknowledged. So the same tuple will be replayed to the processor.

In Flink, the tuple is being acknowledged automatically once the message is consumed by Flink-Kafka Consumer. There are ways to solve this. One such way is to publish the message back to the same queue/retry queue. But I am looking for a solution similar to Storm.

I know that Flink's Savepoint/Checkpoint will be used for fault tolerance. But in my understanding, the tuples will be replayed win case of the Flink's failure. I would like to get ideas on how to handle transient failures.

Thank you

score 3 · Answer 1 · answered Apr 03 '20 at 12:38

When interacting with external systems I would recommend to use Flink's async I/O operator. It allows you to execute asynchronous tasks without blocking the execution of an operator.

If you want to retry failed operations without restarting the Flink job from the last successful checkpoint, then I would suggest to implement the retry policy yourself. It could look the following way:

new AsyncFunction<IN, OUT>() {
    @Override
    public void asyncInvoke(IN input, ResultFuture<OUT> resultFuture) throws Exception {
        FutureUtils
            .retrySuccessfulWithDelay(
                () -> triggerAsyncOperation(input),
                Time.seconds(1L),
                Deadline.fromNow(Duration.ofSeconds(10L)),
                this::decideWhetherToRetry,
                new ScheduledExecutorServiceAdapter(new DirectScheduledExecutorService()))
            .whenComplete((result, throwable) -> {
                if (result != null) {
                    resultFuture.complete(Collections.singleton(result));
                } else {
                    resultFuture.completeExceptionally(throwable);
                }
            })
    }
}

with triggerAsyncOperation encapsulating your asynchronous operation and decideWhetherToRetry encapsulating your retry strategy. If decideWhetherToRetry returns true, then resultFuture will be completed with the value of this operation attempt.

If resultFuture is completed exceptionally, then it will trigger a failover which will cause the job to restart from that last successful checkpoint.

How to handle transient/application failures in Apache Flink?

1 Answers1