Exception when trying to upgrade to flink 1.3.1

Question

I tried to upgrade my flink version in my cluster to 1.3.1 (and 1.3.2 as well) and I got the following exception in my task managers:

2018-02-28 12:57:27,120 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Error during disposal of stream operator.
org.apache.kafka.common.KafkaException: java.lang.InterruptedException
        at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:424)
        at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.close(FlinkKafkaProducerBase.java:317)
        at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:126)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:429)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:334)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:422)
        ... 7 more

The job manager showed that it failed to connect with the task managers.

I am using FlinkKafkaProducer08. Any ideas?

I think that in flink 1.3.1, they try to close operators before actually instantiating them. I guess the problem is with the kafka-producer, I'm having this problem as well, preventing me to upgrade from 1.2 to 1.3.1 — OmriManor, Feb 28 '18 at 13:21
Do you have Flink Checkpointing enabled? https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance — diegoreico, Mar 01 '18 at 00:09
And pls try to add more info about what you are exactly doing, like a example of how you connect to Kafka. — diegoreico, Mar 01 '18 at 00:11
@diegoreico yes, we do have checkpointing enabled. we are connecting to kafka (0.8.2) through the built in flink-kafka consumer/producers, nothing fancy (works in 1.2.1 in production at high scale) — Etan Grundstein, Mar 01 '18 at 12:23

score 1 · Answer 1 · answered Feb 28 '18 at 16:07

First of all, from the stack trace above: it was thrown during operator cleanup of a non-graceful termination (otherwise this code is not executed). It looks as if it should be followed by the real exception that caused the initial problem. Can you provide some more parts of the log?

If the JobManager failed to connect to any TaskManager that should run your job, the whole job will be cancelled (and retried based on your retry policy). The same may happen on your TaskManager side. That may be the root cause and needs further investigation.

Exception when trying to upgrade to flink 1.3.1

1 Answers1

Linked