1

I have a Spark streaming application that listens to a Kafka topic. When getting the data I need to process it and send to Kudu. Currently I am using org.apache.kudu.spark.kudu.KuduContext API and call the insert action with the data frame. In order to create the data frame from my data I need to call collect() so I can create the data frame using sqlContext.

Is there a way to create the dataframe/insert the data into Kudu without calling collect() which is of course costly?

We are using Spark 1.6

tk421
  • 5,775
  • 6
  • 23
  • 34
LubaT
  • 129
  • 2
  • 9
  • Have you considered using Kafka Connect for this? – Robin Moffatt Aug 08 '18 at 14:36
  • I am not familiar with this, will read about it, thanks. – LubaT Aug 09 '18 at 05:38
  • In kafka connect can we define our process of how to convert the data from the topic ? in our case we need to do some calculation and processing before the data is ready for kudu. – LubaT Aug 09 '18 at 05:48
  • The pattern to follow would be a streams processing application (e.g. Kafka Streams, KSQL, etc) would apply the transformation to the data and write that back to a Kafka topic. Kafka Connect then streams that topic to the target. Separation of responsibilities - easier to develop, operate, scale, etc :) – Robin Moffatt Aug 09 '18 at 08:58

1 Answers1

0

Kudu sink for Spark now supports structured streaming: https://issues.apache.org/jira/browse/KUDU-2640

Greg S
  • 466
  • 3
  • 5