We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to perform some action which matches certain conditions. . Executing the query based on each incoming event batch seems to be very slow (As it progresses).
Here slow means say I have configured window size of 600 Sec and Batch interval of 20 Sec. (pumping the data with speed of 1 input per 2 second) So say at the time after 10 min where incoming input will be constant it should take same time to execute the SQL query.
But here after the time elapses it starts taking more time and increases gradually so for about 300 records select count(*) query takes initially 1 sec and later on after 15 minute it starts taking 2 to 3 sec and increases gradually.
Would appreciate if anyone can suggest a better approach to implementing this use case. Given below are the steps we perform in order to achieve this -
//Creating spark and streaming context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc = new JavaStreamingContext(sc, 20);
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream; = FlumeUtils.createStream(ssc, "localhost", 55555);
//Adding the events on window
JavaDStream<SparkFlumeEvent> windowDStream =
flumeStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);
// sc is an existing JavaSparkContext.
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
windowDStream.foreachRDD(new Function<JavaRDD<SparkFlumeEvent>, Void>()
{
public Void call(JavaRDD<SparkFlumeEvent> eventsData)
throws Exception
{
long t2 = System.currentTimeMillis();
lTempTime = System.currentTimeMillis();
JavaRDD<AVEventPInt> inputRDD1 = eventsData.map(new Function<SparkFlumeEvent, AVEventPInt>()
{
@Override
public AVEventPInt call(SparkFlumeEvent eventsData) throws Exception
{
...
return avevent;
}
});
DataFrame schemaevents = sqlContext.createDataFrame(inputRDD1, AVEventPInt.class);
schemaevents.registerTempTable("avevents" + lTempTime);
sqlContext.cacheTable("avevents" + lTempTime);
// here the time taken by query is increasing gradually
long t4 = System.currentTimeMillis();
Long lTotalEvent = sqlContext.sql("SELECT count(*) FROM avevents" + lTempTime).first().getLong(0);
System.out.println("time for total event count: " + (System.currentTimeMillis() - t4) / 1000L + " seconds \n");
sqlContext.dropTempTable("avevents" + lTempTime);
sqlContext.clearCache();
return null;
}
});