I'm using Apache Nifi and Spark and Kafka to send messages between them. First of all, I take data with Nifi and I send it to Spark to process it. Then, I send data from Spark to Nifi again to insert it in a DB.
My problem is Each time I run Spark, I get the same 3.142 records. I have the first part of Nifi stopped, the second is running, and each time I run Spark, I have the same 3.142 records., and now I'm not able to understand this data.
Where does it come from?
I've tried to see if I have data on Kafka-Queue-I (from Nifi to Spark) or Kafka-Queue-II (from Spark to NiFi), but in both cases, the answer is NO. Only, when I run Spark, appears in Kafka-Queue-II 3.142 records, but this doesn't happen on Kafka-Queue-I...
In Nifi, PublishKafka_1_0 1.7.0:
In Spark, KafkaConsumer:
val raw_data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("group.id", "spark_kafka_consumer")
.option("startingOffsets", "latest")
.option("enable.auto.commit", true)
option("failOnDataLoss", "false")
.option("subscribe", input_topic)
.load()
It show me a lot of false_alarm...
Some process...
var parsed_data = raw_data
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
. ...
Kafka Source in Spark
var query = parsed_data
.select(to_json(schema_out).alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("zookeeper.connect", zookeeper_servers)
.option("checkpointLocation", checkpoint_location)
.option("topic", output_topic)
.start()
query.awaitTermination()
And, KafkaConsumer 1.7.0 in NiFi