Caching DataFrame in Spark Thrift Server

Question

I have a Spark Thrift Server. I connect to the Thrift Server and get data of Hive table. If I query the same table again, it will again load the file in memory and execute the query.

Is there any way I can cache the table data using Spark Thrift Server? If yes, please let me know how to do it

score 2 · Answer 1 · answered Aug 16 '17 at 09:55

2

Two things:

use CACHE LAZY TABLE as in this answer: Spark SQL: how to cache sql query result without using rdd.cache() and cache tables in apache spark sql
use spark.sql.hive.thriftServer.singleSession=true so that other clients can use this cached table.

Remember that caching is lazy, so it will be cached during first computation

answered Aug 16 '17 at 09:55

T. Gawęda

15,706
4
46
61

Is there any UI or something to see the cached table? – Aditya Calangutkar Aug 16 '17 at 11:21
@AdityaCalangutkar It will be visible on the Storage tab of Spark UI, however as a RDD not DataFrame or Dataset – T. Gawęda Aug 16 '17 at 12:47
Can you control the cache via SQL (like persist method do) ? (Memory / Disk) – Thomas Decaux Oct 21 '17 at 20:30

Thomas Decaux · Answer 2 · 2017-10-21T20:43:39.693

0

Pay attention that memory could be consumed by the Driver, not the executor (depend on your settings, local/cluster ...), so don't forget to allocate more memory to your driver.

To put in data:

CACHE TABLE today AS
SELECT * FROM datahub WHERE year=2017 AND fullname IN ("api.search.search") LIMIT 40000

Start by limiting the data, then look how memory is consumed to avoid OOM exception.

edited Oct 21 '17 at 20:43

answered Oct 17 '17 at 16:07

Thomas Decaux

21,738
2
113
124

Pls clarify driver comment. You mean for collect I presume? – thebluephantom Jun 01 '19 at 08:32

Caching DataFrame in Spark Thrift Server

2 Answers2