Spark structured stream to kudu context

Question

I want to read kafka topic then write it to kudu table by spark streaming.

My first approach

// sessions and contexts
val conf = new SparkConf().setMaster("local[2]").setAppName("TestMain")
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
val sparkContext = sparkSession.sparkContext
val kuduContext = new KuduContext("...", sparkContext);

// structure
val schema: StructType = StructType(
  StructField("userNo", IntegerType, true) ::
  StructField("bandNo", IntegerType, false) ::
  StructField("ipv4", StringType, false) :: Nil);

// kudu - prepare table
kuduContext.deleteTable("test_table");
kuduContext.createTable("test_table", schema, Seq("userNo"), new CreateTableOptions()
  .setNumReplicas(1)
  .addHashPartitions(List("userNo").asJava, 3))

// get stream from kafka
val parsed = sparkSession
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "...")
  .option("startingOffsets", "latest")
  .option("subscribe", "feed_api_band_get_popular_post_list")
  .load()
  .select(from_json(col("value").cast("string"), schema).alias("parsed_value"))

// write it to kudu
kuduContext.insertRows(parsed.toDF(), "test_table");

Now it complains

Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)

My second approach

It seems I change my code to use traditional KafkaUtils.createDirectStream

KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
).foreachRDD(rdd => {
  rdd.foreach(record => {
    // write to kudu.............
    println(record.value());
  })
});

ssc.start();
ssc.awaitTermination();

So, which one is right approach? or is there any way to make it run from first approach?

Spark version is 2.2.0.

Did you manage to use the first approach? – Mohammad Mazraeh Aug 19 '18 at 11:22 — Mohammad Mazraeh, Aug 19 '18 at 11:22

score 1 · Accepted Answer · answered Mar 12 '18 at 05:27

1

Both the approaches seem right. First one uses the Spark Structured streaming way of doing things wherein the data is appended on a tabular basis. Second method does it via traditional DStream way of doing things

answered Mar 12 '18 at 05:27

Rakshith

644
1
8
24

score 1 · Answer 2 · answered Sep 04 '18 at 19:08

I believe at the present time there is no Kudu support for using the KuduContext with Spark structured streaming. I had a similar issue and had to fall back on using traditional Kudu Client and implementing a ForeachWriter[Row] class. I used the examples here and was able to achieve a solution.

score 1 · Answer 3 · answered Apr 18 '19 at 09:25

The first approach is incorrect, as you already can see from the error, which is very clear: Queries with streaming sources must be executed with writeStream.start(). That will only work on batch.

The second one uses DStream, so not structured streaming.

There is a 3rd and 4th approach.

Starting with Kudu 1.9.0, structured streaming is supported with this issue fixed, and used as expected:

    parsed
      .writeStream
      .format("kudu")
      .option("kudu.master", kuduMaster)
      .option("kudu.table", tableName)
      .option("kudu.operation", operation)
      .start()

Note that if you are using Cloudera, this method will only work with cdh6.2.0 and above:

<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
<dependency>
    <groupId>org.apache.kudu</groupId>
    <artifactId>kudu-spark2_2.11</artifactId>
    <version>1.9.0-cdh6.2.0</version>
    <scope>test</scope>
</dependency>

My solution was to look at the code from SparkContext and see what kuduContext.insertRows(df, table) and the other methods do, and create a ForeachWriter[Row]:

val kuduContext = new KuduContext(master, sparkContext)

  parsed
    .toDF()
    .writeStream
    .foreach(new ForeachWriter[Row] {
      override def open(partitionId: Long, version: Long): Boolean =
        kuduContext.tableExists(table)

      override def process(value: Row): Unit = {
        val kuduClient = kuduContext.syncClient
        val kuduSession = kuduClient.newSession()
        kuduSession.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND)
        kuduSession.setIgnoreAllDuplicateRows(ignoreDuplicates)

        val kuduTable = kuduClient.openTable(kuduSinkConfiguration.table)
        val operation = getOperationFunction(kuduTable) //get the kuduTable.newInsert(), newUpsert(), etc.
        kuduSession.setIgnoreAllDuplicateRows(ignoreDuplicates)

        val row = operation.getRow
        row.add("userNo", value.getAs[Int]("userNo"))
        row.add("bandNo", value.getAs[Int]("bandNo"))
        row.add("ipv4", value.getAs[String]("ipv4"))
        kuduSession.apply(operation)

        kuduSession.flush()
        kuduSession.close()
      }

      override def close(errorOrNull: Throwable): Unit = Unit

    })
    .start()

score 0 · Answer 4 · answered Apr 13 '20 at 11:59

We can also load structured streaming data into Kudu table using Spark version 2.2.0 and cloudera version CDH 5.14. You just need to download spark-kudu-2.2.11 jar corresponding to CDH6.2 and pass it as jar in your spark-submit command. This will identify kudu format in below statement and load dataframe easily.

parsed .writeStream .format("kudu") .option("kudu.master", kuduMaster) .option("kudu.table", tableName) .option("kudu.operation", operation) .start()

JAR can be downloaded from : https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2_2.11/1.10.0-cdh6.3.2

Spark-submit statement:

spark2-submit --master local[*] --deploy-mode client --jars spark-sql-kafka-0-10_2.11-2.2.0.jar,kafka-clients-0.10.0.0.jar,spark-streaming-kafka-0-10_2.11-2.2.0.jar,kudu-spark2_2.11-1.10.0-cdh6.3.2.jar,kudu-client-1.10.0-cdh6.3.2.jar /path_of_python_code/rdd-stream-read.py

Note- Kudu-client is optional. Might have to used with cluster deploy mode.

writestream statement used:

query=dfCols.writeStream.format("kudu").option("kudu.master", "host:7051,host:7051,host:7051").option("kudu.table","impala::db.kudu_table_name").option("kudu.operation","upsert").option("checkpointLocation","file:///path_of_dir/checkpoint/").start()

Spark structured stream to kudu context

My first approach

My second approach

4 Answers4