How to insert data from Kafka to Kudu using Spark streaming

Question

I have a Spark streaming application that listens to a Kafka topic. When getting the data I need to process it and send to Kudu. Currently I am using org.apache.kudu.spark.kudu.KuduContext API and call the insert action with the data frame. In order to create the data frame from my data I need to call collect() so I can create the data frame using sqlContext.

Is there a way to create the dataframe/insert the data into Kudu without calling collect() which is of course costly?

We are using Spark 1.6

In kafka connect can we define our process of how to convert the data from the topic ? in our case we need to do some calculation and processing before the data is ready for kudu. — LubaT, Aug 09 '18 at 05:48
The pattern to follow would be a streams processing application (e.g. Kafka Streams, KSQL, etc) would apply the transformation to the data and write that back to a Kafka topic. Kafka Connect then streams that topic to the target. Separation of responsibilities - easier to develop, operate, scale, etc :) — Robin Moffatt, Aug 09 '18 at 08:58

score 0 · Answer 1 · answered Jan 17 '19 at 17:35

0

Kudu sink for Spark now supports structured streaming: https://issues.apache.org/jira/browse/KUDU-2640

answered Jan 17 '19 at 17:35

Greg S

466
3
5

How to insert data from Kafka to Kudu using Spark streaming

1 Answers1