0

If I execute a Filter on KuduRDD, then first the Spark job read in all the data from Kudu table and execute the filter job within the Spark application, or the filtering happen on Kudu Server, and the Spark Application receive only filtered data?

gszecsenyi
  • 93
  • 9

1 Answers1

1

With RDD all data will be fetched to Spark first. kuduRDD returns just a plain RDD[Row]:

def kuduRDD(sc: SparkContext,
            tableName: String,
            columnProjection: Seq[String] = Nil): RDD[Row] = { ...

and there are no special optimizations afterwards.

With Dataframe API, according to Up and running with Apache Spark on Apache Kudu, following predicates can be pushed down:

Equal to (=)

Greater than (>)

Greater than or equal (>=)

Less than (<)

Less than or equal (<=)

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115