1

I've just learned about SnappyData (and watched some videos about it), and it looks interesting mainly when says that the performance might be many times faster than a regular spark job.

Could the following code (snippet) leverage on the SnappyData capabilities to improve the performance of a job and provide the same behavior ?

Dataset<EventData> ds = spark
  .readStream()
  .format("kafka")
  (...)
  .as(Encoders.bean(EventData.class)); 

KeyValueGroupedDataset<String, EventData> kvDataset = ds.groupByKey(new MapFunction<EventData, String>() {
  public String call(EventData value) throws Exception {
    return value.getId();
  }
}, Encoders.STRING());

Dataset<EventData> processedDataset = kvDataset.mapGroupsWithState(new MapGroupsWithStateFunction<String, EventData, EventData, EventData>(){
  public EventData call(String key, Iterator<EventData> values, GroupState<EventData> state) throws Exception {

    /* state control code */

    EventData processed = EventHandler.validate(key,values);

    return processed;

}}, Encoders.bean(EventData.class), Encoders.bean(EventData.class));

StreamingQuery query = processedDataset.writeStream()
  .outputMode("update")
  .format("console")
  .start();
zero323
  • 322,348
  • 103
  • 959
  • 935
Kleyson Rios
  • 2,597
  • 5
  • 40
  • 65

1 Answers1

1

I doubt SnappyData will optimize this pipeline. The optimizations are designed to work on DataFrames (managed in-memory tables) and for common operators like GroupBy, Join, scan, etc.

In your example, I would imagine the mapping functions dominate the processing time. Perhaps, it is possible to convert the Dataset<EventData> to Dataset<Row> (using toDF()), store it in a table, use either built-in spark-sql operators or UDFs and then operate on it. That could change the ingestion rate significantly.

In this simple example you are outputting to Console. In real world, I assume you ingest this state into some store. This is where SnappyData could make a big difference.

plamb
  • 5,636
  • 1
  • 18
  • 31
jagsr
  • 535
  • 2
  • 6