4

Appreciate your help to run a spark streaming program using spark 2.0.2.

The run errors with "java.lang.ClassNotFoundException: Failed to find data source: kafka". Modified POM file as below.

Spark is being created but errors when the load from kafka is called.

Created spark session:

 val spark = SparkSession
            .builder()
            .master(master)
            .appName("Apache Log Analyzer Streaming from Kafka")
            .config("hive.metastore.warehouse.dir", hiveWarehouse)
            .config("fs.defaultFS", hdfs_FS)
            .enableHiveSupport()
            .getOrCreate()

Creating kafka streaming:

    val logLinesDStream = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:2181")
      .option("subscribe", topics)
      .load()

Error message:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark-packages.org

POM.XML:

    <scala.version>2.10.4</scala.version>
        <scala.compat.version>2.10</scala.compat.version>
        <spark.version>2.0.2</spark.version>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>

       <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
       <version>${spark.version}</version>
       </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.5</version>
        </dependency>
</dependencies>
Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
Aavik
  • 967
  • 19
  • 48

3 Answers3

5

You're referencing Spark's v1.5.1 reference of Kafka when you actually need 2.0.2. You also need to use sql-kafka for Structured Streaming:

<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.10</artifactId>
<version>2.0.2</version>

Note that the SparkSession API is supported only for Kafka >= 0.10

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks for the details. As suggested, modified the POM file but still end up with the same error. 2.10.4 2.0.2 Am I missing any details, that the compiler is unable to identify "kafka"? P.S - Current POM file looks like the one above(modified in the question) – Aavik Dec 13 '16 at 06:49
  • In the prev version the parameters "group.id" -> "consumergroup", "metadata.broker.list" -> "localhost:9092" are set. Let me try. – Aavik Dec 13 '16 at 07:02
  • In the jar build, I could not see kafka directory in the path "/org/apache/spark/sql/execution/datasources/". I could see other source fileformat - csv, parquet, jdbc, json and text. Any help in identifying how do I get kafka file format. – Aavik Dec 13 '16 at 07:48
  • just upgrade spark version from 2.0.0 to 2.2.0 along with suggested dependency for spark version 2.2.0. thanks for your suggestion. – Rajeev Rathor Nov 22 '17 at 17:18
3

I faced the same issue. I have upgraded spark version from 2.0.0 to 2.2.0 and added Spark-sql-kafka dependency. It is working perfectly for me. please find dependencies.

<spark.version>2.2.0</spark.version>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>${spark.version}</version>
    <scope>test</scope>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka_2.11</artifactId>
    <version>0.10.2.0</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>0.10.2.0</version>
</dependency>
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
Rajeev Rathor
  • 1,830
  • 25
  • 20
1

Got it fixed by changing POM.XML

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
Aavik
  • 967
  • 19
  • 48
  • Aren't you using Scala 2.10? – Yuval Itzchakov Dec 13 '16 at 09:21
  • Then how are you compiling with a Scala 2.11 dependency? – Yuval Itzchakov Dec 13 '16 at 11:39
  • @YuvalItzchakov, return datatype is sql.DataFrame. But the one created from val logLinesDStream = KafkaUtils.createStream, datatype is DStream[String]. Is the spark.readStream right option for developing streaming application? I am aware that spark.readStream is from 2.0. Thank you – Aavik Dec 13 '16 at 16:43