0

Is partition pruning enabled for cached TempTables in apache spark? If so, how do I configure it?

My data is a bunch of sensor readings in different installations, one row contains installationName, tag, timestamp and value.

I have written the data in parquet format using the following commands:

rdd.toDF("installationName", "tag", "timestamp", "value")
  .repartition($"installationName", $"tag")
  .write.partitionBy("installationName","tag").mode("overwrite").parquet(config.output)

I read this data using the following command into an SQL table using Spark HiveContext:

val parquet = hc.read.parquet("/path_to_table/tablename")
parquet.registerTempTable("tablename")

Now if I run a SQL query on this table, it does partition pruning as expected:

hc.sql("select * from tablename where installationName = 'XXX' and tag = 'YYY'")

And the query takes around 8 seconds. But if I cache the table in memory, and then perform the same query, it always takes around 50 seconds:

hc.sql("CACHE TABLE tablename")
hc.sql("select * from tablename where installationName = 'XXX' and tag = 'YYY'")

I am currently using Spark 1.6.1.

  • Hi, thanks for your comment. Indeed I do a repartition operation before I write data to parquet. I also tested the above query with the repartitioning and it is more efficient with a query time of 20 s, but still it is slower than reading from the parquet files without caching. My purpose is to avoid writing to parquet files altogether. Could you maybe provide some source - how do you know that partition pruning is not supported after caching? If you would write an answer here, I could accept it. – Juha-Antti Feb 13 '17 at 07:12
  • Correction, caching in memory reduces the query time to less than 1 second, which is of course already acceptable. I wonder, whether it scales: this is only part of my dasta, I actually have over 200 times more and continually growing, so the more data I have, the more time scanning through all the partitions takes, so partition pruning would seem beneficial here. – Juha-Antti Feb 14 '17 at 14:12

1 Answers1

0

The reason that it happens is due to the how Cache works in spark.

When you call some kind of process to a DataFrame, RDD or DataSet the execution has a plan see below:

val df = sc.parallelize(1 to 10000).toDF("line")
df.withColumn("new_line", col("line") * 10).queryExecution

The command queryExecution return to you the plan. See the logical plan below of the code:

== Parsed Logical Plan ==
Project [*,('line * 10) AS new_line#7]
+- Project [_1#4 AS line#5]
   +- LogicalRDD [_1#4], MapPartitionsRDD[9] at 

== Analyzed Logical Plan ==
line: int, new_line: int
Project [line#5,(line#5 * 10) AS new_line#7]
+- Project [_1#4 AS line#5]
   +- LogicalRDD [_1#4], MapPartitionsRDD[9] at 

== Optimized Logical Plan ==
Project [_1#4 AS line#5,(_1#4 * 10) AS new_line#7]
+- LogicalRDD [_1#4], MapPartitionsRDD[9] at intRddToDataFrameHolder at 

== Physical Plan ==
Project [_1#4 AS line#5,(_1#4 * 10) AS new_line#7]
+- Scan ExistingRDD[_1#4]

In this case you can see all the process that your code will do. When you call a cache function like this:

 df.withColumn("new_line", col("line") * 10).cache().queryExecution

The result will be like this:

== Parsed Logical Plan ==
'Project [*,('line * 10) AS new_line#8]
+- Project [_1#4 AS line#5]
   +- LogicalRDD [_1#4], MapPartitionsRDD[9] at intRddToDataFrameHolder at <console>:34

== Analyzed Logical Plan ==
line: int, new_line: int
Project [line#5,(line#5 * 10) AS new_line#8]
+- Project [_1#4 AS line#5]
   +- LogicalRDD [_1#4], MapPartitionsRDD[9] at intRddToDataFrameHolder at <console>:34

== Optimized Logical Plan ==
InMemoryRelation [line#5,new_line#8], true, 10000, StorageLevel(true, true, false, true, 1), Project [_1#4 AS line#5,(_1#4 * 10) AS new_line#8], None

== Physical Plan ==
InMemoryColumnarTableScan [line#5,new_line#8], InMemoryRelation [line#5,new_line#8], true, 10000, StorageLevel(true, true, false, true, 1), Pro...

This execution will return to you the execution of an InMemoryRelation in optmized logical plan, this will save a structure of data in your Memory, or if your data is really big it will spills to the Disk.

The time to save this in your cluster take time, it will be a little bit slow in the first execution, but when you need to access again the same data in other place the DF or the RDD will be saved and the Spark will not request the execution again.

Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • Thank you for your answer! In spark table caching is an eager operation, which means that the data is cached already when I run my query for the first time. Caching the data actually takes 500 seconds here, and indeed after the cache, the performance of the query is improved, it now takes only 50 seconds to scan through all the partitions. No matter how many times I run the query, the performance is always about the same. Your answer does not address my question, which is about partition pruning after cahcing. – Juha-Antti Feb 13 '17 at 07:15