0

My spark streaming application takes only once every record when I use it on local, but, when I deploy it on a standalone cluster it reads two times the same message from Kafka. Also, I've double checked that this is not a problem related to the kafka producer.

This is how I create the stream:

val stream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent,
                             Subscribe[String, String]("aTopic", kafkaParams))

This is the kafkaParams configuration:

"bootstrap.servers" -> KAFKA_SERVER,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test-group",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)

The cluster has 2 workers with one executor per worker, it looks like every worker takes the same message. Anybody can help me, please?

EDIT

As an example, when I send one point from kafka. From this code:

    stream.foreachRDD((rdd, time) => {
          if (!rdd.isEmpty) {
            log.info("Taken "+rdd.count()+ " points")
        }
    }

I obtain "Taken 2 points". If I print them, they are equal. Am I doing something wrong?

I'm using

  • "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0"
  • spark 2.2.0
  • kafka_2.11-0.11.0.1
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
ggagliano
  • 1,004
  • 1
  • 11
  • 27
  • 1
    Just to be clear. Do you see data duplication in the stream? If you do, this question would benefit from additional details (versions, exact configuration, some form of MCVE). Such behavior would an unusual bug if confirmed. – Alper t. Turker Jun 03 '18 at 10:46
  • @user8371915 I've added more details. Thank you – ggagliano Jun 03 '18 at 12:08
  • 1
    If you use console consumer from Kafka 0.10, how many messages do you see? When you do the same using 0.11, how many (the way clients read the logs changed between these versions) – OneCricketeer Jun 03 '18 at 12:11
  • 1
    Ok, I've tried with the console consumer and messages are duplicated also within the same machine. I'll double check the producer, thanks – ggagliano Jun 03 '18 at 12:37
  • 1
    At the end resulted that I had two producer running in background, thank you – ggagliano Jun 03 '18 at 15:35

0 Answers0