I have streaming data in Azure Databricks being stored in delta table format. For optimization,I am currently using Z-ordering. Are there any benefits of using Hyperspace indexing subsytem over Z-ordering?
-
I believe that is the valid question - it's not about "what is better", but what could be the benefits of using it – Alex Ott May 03 '21 at 12:04
1 Answers
Disclaimer: I didn't use Hyperspace myself, just read documentation & code examples.
Hyperspace by functionality is closer to the Data Skipping functionality of the Databricks Delta implementation - it allows to read only that data that is necessary. But on Databricks, indexing of data happens automatically when they are written, while with Hyperspace you need to build indexes & maintain them.
ZOrder is a different functionality - it optimizes placement of the data, so there is a higher probability that data that are used often together are really placed together, so you'll read less files. Hyperspace doesn't have that - it just indexes data, and placement of the data is defined by the underlying file format.
P.S. Here is the good blog post from Databricks about Data Skipping and ZOreder.

- 80,552
- 8
- 87
- 132
-
Do you a reference for : "But on Databricks, indexing of data happens automatically when they are written" ? I am not sure if this is necessarily true. – Nitish Upreti May 17 '21 at 18:41
-
The statistics is collected automatically on first N columns (configurable, 32 by default). The same for bloom filters – Alex Ott May 17 '21 at 19:09