1

All,

I am working on consuming data from Kafka on dump into HDFS. I am able to consume data and wanted to get the total counts of records from Kafka and save as a file into HDFS so that i can use that file for the validation. I was able to print records in console but i am not sure how can i create the file of total count?

Query to pull records from Kafka:

Dataset ds1=ds.filter(args[5]);
 StreamingQuery query = ds1
                   .coalesce(10)
                   .writeStream()
                   .format("parquet")
                   .option("path", path.toString())
                   .option("checkpointLocation", args[6] + "/checkpoints" + args[2])
                   .trigger(Trigger.Once())
                   .start();

          try {
                query.awaitTermination();
            } catch (StreamingQueryException e) {
                e.printStackTrace();
                System.exit(1);
            }   

and the code that i have written to get the records and print in console:

Dataset stream=ds1.groupBy("<column_name>").count(); // Actually, I wanted to get the count without using GroupBy, i have tried long stream=ds1.count() but i was encounter with the error.

 StreamingQuery query1=stream.coalesce(1)
                        .writeStream()
                        .format("csv")
                       .option("path", path + "/record")
                       .start();

               try {
                    query1.awaitTermination();
                 } catch (StreamingQueryException e) {
                     e.printStackTrace();
                    System.exit(1);
                } 

This is not working, can you please help me to solve this problem?

Rab
  • 159
  • 1
  • 11

1 Answers1

2

The number of records at any time in a topic is a moving target.

You would need to use old Spark Streaming to find number of records per Spark partiton batch, then use an Accumulator to count all records processed, but that would be the closest your could get.

Spark + Kafka is claimed to have exactly once processing semantics, so I would suggest you focus on error capturing and monitoring over just count validation.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245