2

I am new to Spark. I am experimenting to use Spark 2.1 version for CEP purpose. To detect missing event in last 2 minutes. I am converting received input to input events of JavaDSStream and then performing reducebykeyandWindow on inputEvents and executes spark sql.

 JavaPairDStream<String, Long> reduceWindowed =   inputEvents.reduceByKeyAndWindow(new MaxTimeFuntion(),
                Durations.seconds(124), new Duration(2000));
 reduceWindowed.foreachRDD((rdd, time) -> {
              SparkSession spark = TestSparkSessionSingleton.getInstance(rdd.context().getConf()); 
              JavaRDD<EventData> rowRDD = rdd.map(new org.apache.spark.api.java.function.Function<Tuple2<String,Long>, EventData>() {
                    @Override
                    public EventData call(Tuple2<String, Long> javaRDD) {
                    {
                           EventData record = new EventData ();
                            record.setId(javaRDD._1); 
                            record.setEventTime(javaRDD._2);
                             return record;               
                    }
              })
    Dataset<Row> eventDataFrames = spark.createDataFrame(rowRDD, EventData.class);
     eventDataFrames.createOrReplaceTempView("checkins");  


Dataset<Row> resultRows=                         
                    spark.sql("select id, max(eventTime) as maxval,  from events group by id having (unix_timestamp()*1000 - maxval >= 120000)");

Same filtering I performed using RDD functions:

JavaPairDStream<String, Long> filteredStream = reduceWindowed.filter(new Function<Tuple2<String,Long>, Boolean>() {

        public Boolean call(Tuple2<String,Long> val)
        {
           return (System.currentTimeMillis() - val._2() >= 120000);
       }
    });

    filteredStream.print();

Both the approach provides me same result for a dataSet & RDD.

Am I using Spark sql properly.?

In local mode, Spark SQL query execution consumes relatively high CPU than RDD function for the same input rate. Can anyone help me in understanding why Spark SQL consumes relatively high CPU compared to RDD filter function..

Abirami
  • 87
  • 7

1 Answers1

1

Spark SQL use catalyst (SQL optimizer) which does:

  1. Analysis of sql queries
  2. Make some logical optimizations
  3. Adds some physical planning
  4. Generates some code

DataSets internally rows externally JVM objects. Can be used type safe + fast. Slower than DataFrames and not as good for interactive analysis. The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

FaigB
  • 2,271
  • 1
  • 13
  • 22
  • I've seen it said in a few places that untyped Datasets are faster than typed ones. But how can this be?? – georgiosd Apr 16 '18 at 12:14