I am new to Spark. I am experimenting to use Spark 2.1 version for CEP purpose. To detect missing event in last 2 minutes. I am converting received input to input events of JavaDSStream and then performing reducebykeyandWindow on inputEvents and executes spark sql.
JavaPairDStream<String, Long> reduceWindowed = inputEvents.reduceByKeyAndWindow(new MaxTimeFuntion(),
Durations.seconds(124), new Duration(2000));
reduceWindowed.foreachRDD((rdd, time) -> {
SparkSession spark = TestSparkSessionSingleton.getInstance(rdd.context().getConf());
JavaRDD<EventData> rowRDD = rdd.map(new org.apache.spark.api.java.function.Function<Tuple2<String,Long>, EventData>() {
@Override
public EventData call(Tuple2<String, Long> javaRDD) {
{
EventData record = new EventData ();
record.setId(javaRDD._1);
record.setEventTime(javaRDD._2);
return record;
}
})
Dataset<Row> eventDataFrames = spark.createDataFrame(rowRDD, EventData.class);
eventDataFrames.createOrReplaceTempView("checkins");
Dataset<Row> resultRows=
spark.sql("select id, max(eventTime) as maxval, from events group by id having (unix_timestamp()*1000 - maxval >= 120000)");
Same filtering I performed using RDD functions:
JavaPairDStream<String, Long> filteredStream = reduceWindowed.filter(new Function<Tuple2<String,Long>, Boolean>() {
public Boolean call(Tuple2<String,Long> val)
{
return (System.currentTimeMillis() - val._2() >= 120000);
}
});
filteredStream.print();
Both the approach provides me same result for a dataSet & RDD.
Am I using Spark sql properly.?
In local mode, Spark SQL query execution consumes relatively high CPU than RDD function for the same input rate. Can anyone help me in understanding why Spark SQL consumes relatively high CPU compared to RDD filter function..