You should use sqlContext.cacheTable("table_name")
in order to cache it, or alternatively use CACHE TABLE table_name
SQL query.
Here's an example. I've got this file on HDFS:
1|Alex|alex@gmail.com
2|Paul|paul@example.com
3|John|john@yahoo.com
Then the code in PySpark:
people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')
Now we have a table and can query it:
sqlContext.sql('select * from people').collect()
To persist it, we have 3 options:
# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution:
output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()