Filtering from KuduRDD happen locally in Spark Application or in Kudu Server?

Question

If I execute a Filter on KuduRDD, then first the Spark job read in all the data from Kudu table and execute the filter job within the Spark application, or the filtering happen on Kudu Server, and the Spark Application receive only filtered data?

Spark get All data from Kudu server into Spark RDD and then it does the filter — Sandish Kumar H N, Feb 05 '18 at 10:08

score 1 · Answer 1 · answered Jan 20 '18 at 16:56

With RDD all data will be fetched to Spark first. kuduRDD returns just a plain RDD[Row]:

def kuduRDD(sc: SparkContext,
            tableName: String,
            columnProjection: Seq[String] = Nil): RDD[Row] = { ...

and there are no special optimizations afterwards.

With Dataframe API, according to Up and running with Apache Spark on Apache Kudu, following predicates can be pushed down:

Equal to (=)

Greater than (>)

Greater than or equal (>=)

Less than (<)

Less than or equal (<=)

Filtering from KuduRDD happen locally in Spark Application or in Kudu Server?

1 Answers1