Spark Structured Streaming supported by SnappyData

Question

I've just learned about SnappyData (and watched some videos about it), and it looks interesting mainly when says that the performance might be many times faster than a regular spark job.

Could the following code (snippet) leverage on the SnappyData capabilities to improve the performance of a job and provide the same behavior ?

Dataset<EventData> ds = spark
  .readStream()
  .format("kafka")
  (...)
  .as(Encoders.bean(EventData.class)); 

KeyValueGroupedDataset<String, EventData> kvDataset = ds.groupByKey(new MapFunction<EventData, String>() {
  public String call(EventData value) throws Exception {
    return value.getId();
  }
}, Encoders.STRING());

Dataset<EventData> processedDataset = kvDataset.mapGroupsWithState(new MapGroupsWithStateFunction<String, EventData, EventData, EventData>(){
  public EventData call(String key, Iterator<EventData> values, GroupState<EventData> state) throws Exception {

    /* state control code */

    EventData processed = EventHandler.validate(key,values);

    return processed;

}}, Encoders.bean(EventData.class), Encoders.bean(EventData.class));

StreamingQuery query = processedDataset.writeStream()
  .outputMode("update")
  .format("console")
  .start();

score 1 · Answer 1 · edited Jan 17 '18 at 16:48

I doubt SnappyData will optimize this pipeline. The optimizations are designed to work on DataFrames (managed in-memory tables) and for common operators like GroupBy, Join, scan, etc.

In your example, I would imagine the mapping functions dominate the processing time. Perhaps, it is possible to convert the Dataset<EventData> to Dataset<Row> (using toDF()), store it in a table, use either built-in spark-sql operators or UDFs and then operate on it. That could change the ingestion rate significantly.

In this simple example you are outputting to Console. In real world, I assume you ingest this state into some store. This is where SnappyData could make a big difference.

Spark Structured Streaming supported by SnappyData

1 Answers1