Spark SQL: how to cache sql query result without using rdd.cache()

Question

Is there any way to cache a cache sql query result without using rdd.cache()? for examples:

output = sqlContext.sql("SELECT * From people")

We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.

So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

score 28 · Accepted Answer · answered Jan 19 '15 at 16:25

You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

Here's an example. I've got this file on HDFS:

1|Alex|alex@gmail.com
2|Paul|paul@example.com
3|John|john@yahoo.com

Then the code in PySpark:

people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')

Now we have a table and can query it:

sqlContext.sql('select * from people').collect()

To persist it, we have 3 options:

# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()     
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()

1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion

So going back to your question, here's one possible solution:

output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()

"Spark SQL can cache tables using an in-memory columnar format by calling spark.catalog.cacheTable("tableName") or dataFrame.cache()." (https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory) . The documentation seems to imply that 1st, 2nd and 3rd option are all equivalent. — asmaier, Oct 26 '17 at 20:26

Rick Moritz · Answer 2 · 2017-05-10T09:05:41.150

9

The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments

CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query

then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.

This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

edited May 10 '17 at 09:05

answered Apr 05 '17 at 12:49

Rick Moritz

1,449
12
25

To me, this is the most powerful feature of Spark ! Especially for JOIN operations. – Thomas Decaux Sep 10 '18 at 13:33

Spark SQL: how to cache sql query result without using rdd.cache()

2 Answers2

Linked