0

I wanted to know if I explicitly cache a query as below

CACHE SELECT * FROM boxes

and later run another query like SELECT C1 FROM boxes, will this query be able to use the same cache. Or do we need to have the same query construct to use the disk cache. Also, if we are able to use disk cache, will it also help in reducing compute cost?

Rajib Deb
  • 1,496
  • 11
  • 30

1 Answers1

0

There probably isn't much benefit for caching a SELECT *, but you can cache a subset / preprocessed portion of the data to another Delta table.

boxes_df = spark.table("boxes")
smaller_df = boxes_df.filter(boxes_df.price > 20)
smaller_df.write.format("delta").saveAsTable("less_boxes")

Then you can query the subset as follows:

SELECT * FROM less_boxes

This pattern can reduce compute cost, depending on the circumstances.

Powers
  • 18,150
  • 10
  • 103
  • 108