0

I'm using Apache Nifi and Spark and Kafka to send messages between them. First of all, I take data with Nifi and I send it to Spark to process it. Then, I send data from Spark to Nifi again to insert it in a DB.

My problem is Each time I run Spark, I get the same 3.142 records. I have the first part of Nifi stopped, the second is running, and each time I run Spark, I have the same 3.142 records., and now I'm not able to understand this data.

Where does it come from?

I've tried to see if I have data on Kafka-Queue-I (from Nifi to Spark) or Kafka-Queue-II (from Spark to NiFi), but in both cases, the answer is NO. Only, when I run Spark, appears in Kafka-Queue-II 3.142 records, but this doesn't happen on Kafka-Queue-I...

In Nifi, PublishKafka_1_0 1.7.0:

PublishKafka_1_0 1.7.0

In Spark, KafkaConsumer:

val raw_data = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafka_servers)
  .option("group.id", "spark_kafka_consumer")
  .option("startingOffsets", "latest")
  .option("enable.auto.commit", true)
  option("failOnDataLoss", "false")
  .option("subscribe", input_topic)
  .load()

It show me a lot of false_alarm... False Alarm

Some process...

var parsed_data = raw_data
  .select(from_json(col("value").cast("string"), schema).alias("data"))
  .select("data.*")
  . ...

Kafka Source in Spark

var query = parsed_data
  .select(to_json(schema_out).alias("value"))
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", kafka_servers)
  .option("zookeeper.connect", zookeeper_servers)
  .option("checkpointLocation", checkpoint_location)
  .option("topic", output_topic)
  .start()
query.awaitTermination()

And, KafkaConsumer 1.7.0 in NiFi

NiFi_Kafka_Consumer

Krakenudo
  • 182
  • 1
  • 17

1 Answers1

0

I suspect that you're using a new (auto-generated) consumer group every time you start Spark and you have an offset reset policy of earliest. The result of this would be that Spark starts from the beginning of the topic every time.

Kafka does not remove messages from the topic when its consumed (unlike other pub-sub systems). To not see old messages, you will need to set a consumer group and have Spark commit the offsets as it processes. These offsets are stored and then the next time a consumer starts with that group, it will pick up from the last stored offset for that group.

I would also note that Kafka, outside of some very specific usage patterns and technology choices, does not promise "exactly-once" messaging but instead "at-least-once"; my general advice there would be to try to be tolerant of duplicated records.

Levi Ramsey
  • 18,884
  • 1
  • 16
  • 30
  • The checkpoint location should be storing the offsets, thus preventing group resets – OneCricketeer Feb 13 '20 at 17:35
  • I've stablished a consumer group and I've set checkpoint location, is it not enough? – Krakenudo Feb 14 '20 at 08:23
  • I have tolerant of duplicated records, but it takes more time in my process, it's why I'm trying to skip these 3142 records. What surprises me is the first queue does not have records when I try to read from cli, so, where does this data come from? Now, I've reset the containers and I restart pc and problematic record down from 3142 to 1.642 records... – Krakenudo Feb 14 '20 at 08:28
  • it seems to be related to checkpoints. If I delete hdfs checkpoint's folder, then no records appear when I run Spark. Does it have sense? – Krakenudo Feb 14 '20 at 09:31