Spark Kafka Data Consuming Package

Question

I tried to consume my kafka topic with the code below as mentioned in documentations:

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092,") \
  .option("subscribe", "first_topic") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

and I get the error:

AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

So i tried:

./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 ...

to install the kafka package and it's dependencies. but I get this error:

21/06/21 13:45:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/home/soheil/spark-3.1.2-bin-hadoop3.2/... does not exist'.  Please specify one with --class.
    at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:968)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:486)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

what should I do to install this package?

OneCricketeer · Answer 1 · 2021-06-21T21:16:29.473

The error you're getting here is not related to Kafka

file:/home/soheil/spark-3.1.2-bin-hadoop3.2/... does not exist

This is referencing your HADOOP_HOME and/or HADOOP_CONF_DIR variables on your PATH that Spark depends on. Check these are configured correctly and that you can run the Spark Structured Streaming WordCount examples that use Kafka before running your own scripts.

$ bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
     structured_kafka_wordcount.py \
     host1:port1,host2:port2 subscribe topic1,topic2

The next part Please specify one with --class. is saying that the CLI parser failed; probably because you mistyped the spark-submit options or there is a space in your filepaths, for example

Spark Kafka Data Consuming Package

1 Answers1

Linked