4

I tried to upgrade my flink version in my cluster to 1.3.1 (and 1.3.2 as well) and I got the following exception in my task managers:

2018-02-28 12:57:27,120 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Error during disposal of stream operator.
org.apache.kafka.common.KafkaException: java.lang.InterruptedException
        at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:424)
        at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.close(FlinkKafkaProducerBase.java:317)
        at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:126)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:429)
        at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:334)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:422)
        ... 7 more

The job manager showed that it failed to connect with the task managers.

I am using FlinkKafkaProducer08. Any ideas?

Etan Grundstein
  • 395
  • 1
  • 11
  • I think that in flink 1.3.1, they try to close operators before actually instantiating them. I guess the problem is with the kafka-producer, I'm having this problem as well, preventing me to upgrade from 1.2 to 1.3.1 – OmriManor Feb 28 '18 at 13:21
  • which version of Kafka are you running? – diegoreico Feb 28 '18 at 13:55
  • @diegoreico 0.8.2 – OmriManor Feb 28 '18 at 16:28
  • Do you have Flink Checkpointing enabled? https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance – diegoreico Mar 01 '18 at 00:09
  • And pls try to add more info about what you are exactly doing, like a example of how you connect to Kafka. – diegoreico Mar 01 '18 at 00:11
  • @diegoreico yes, we do have checkpointing enabled. we are connecting to kafka (0.8.2) through the built in flink-kafka consumer/producers, nothing fancy (works in 1.2.1 in production at high scale) – Etan Grundstein Mar 01 '18 at 12:23

1 Answers1

1

First of all, from the stack trace above: it was thrown during operator cleanup of a non-graceful termination (otherwise this code is not executed). It looks as if it should be followed by the real exception that caused the initial problem. Can you provide some more parts of the log?

If the JobManager failed to connect to any TaskManager that should run your job, the whole job will be cancelled (and retried based on your retry policy). The same may happen on your TaskManager side. That may be the root cause and needs further investigation.

Nico Kruber
  • 111
  • 2