How can I use delta cache function in local spark cluster mode?

Question

I'd like to test delta-cache in local cluster mode (jupyter)

1. What I want to do:

Whole delta-formatted files aren't re-downloaded every time, only new data will be re-downloaded

2. What I've tried

...
# cell1
spark.conf.set("spark.databricks.io.cache.enabled", "true")

# cell2
spark.sql("""
CREATE TABLE my_table2
USING DELTA
LOCATION 'MY_S3_DELTA_FORMATTED_PATH';
""")

# cell3
import time
s = time.time()
spark.sql("select * from my_table2").show()
print(time.time() - s)

But cell3 shows the same time taken every time I run the cell, which means that the table is not cached.

Did I miss something?

score 0 · Answer 1 · answered Dec 18 '22 at 18:18

0

https://docs.databricks.com/optimizations/disk-cache.html

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html

Disk cache is different to Spark caching using SQL.

Read the links. Then I think the difference will be apparent. You have missed something.

You need CACHE ... TABLE....

answered Dec 18 '22 at 18:18

thebluephantom

16,458
8
40
83

Isn't `CACHE ... TABLE...` a spark a cache, not a delta cache? – user3595632 Dec 18 '22 at 22:05
Suggest you read and work out the differences. – thebluephantom Dec 18 '22 at 22:19
I've already read once, and seen `When the disk cache is enabled, Data that has to be fetched from a remote source is automatically added to the cache.` , `CACHE SELECT` is just for forcing to cache... isn't it? – user3595632 Dec 18 '22 at 22:30
2 totally different concepts, I'm leaving it at this. Success. – thebluephantom Dec 19 '22 at 07:45

How can I use delta cache function in local spark cluster mode?

1. What I want to do:

2. What I've tried

1 Answers1