0

I'd like to test delta-cache in local cluster mode (jupyter)

1. What I want to do:

Whole delta-formatted files aren't re-downloaded every time, only new data will be re-downloaded

2. What I've tried

...
# cell1
spark.conf.set("spark.databricks.io.cache.enabled", "true")

# cell2
spark.sql("""
CREATE TABLE my_table2
USING DELTA
LOCATION 'MY_S3_DELTA_FORMATTED_PATH';
""")

# cell3
import time
s = time.time()
spark.sql("select * from my_table2").show()
print(time.time() - s)

But cell3 shows the same time taken every time I run the cell, which means that the table is not cached.

Did I miss something?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
user3595632
  • 5,380
  • 10
  • 55
  • 111

1 Answers1

0

https://docs.databricks.com/optimizations/disk-cache.html

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html

Disk cache is different to Spark caching using SQL.

Read the links. Then I think the difference will be apparent. You have missed something.

You need CACHE ... TABLE....

thebluephantom
  • 16,458
  • 8
  • 40
  • 83