If I execute a Filter on KuduRDD, then first the Spark job read in all the data from Kudu table and execute the filter job within the Spark application, or the filtering happen on Kudu Server, and the Spark Application receive only filtered data?
Asked
Active
Viewed 328 times
0
-
Spark get All data from Kudu server into Spark RDD and then it does the filter – Sandish Kumar H N Feb 05 '18 at 10:08
1 Answers
1
With RDD
all data will be fetched to Spark first. kuduRDD
returns just a plain RDD[Row]
:
def kuduRDD(sc: SparkContext,
tableName: String,
columnProjection: Seq[String] = Nil): RDD[Row] = { ...
and there are no special optimizations afterwards.
With Dataframe
API, according to Up and running with Apache Spark on Apache Kudu, following predicates can be pushed down:
Equal to (=)
Greater than (>)
Greater than or equal (>=)
Less than (<)
Less than or equal (<=)

Alper t. Turker
- 34,230
- 9
- 83
- 115