0

I have streaming data in Azure Databricks being stored in delta table format. For optimization,I am currently using Z-ordering. Are there any benefits of using Hyperspace indexing subsytem over Z-ordering?

Priyanshu
  • 111
  • 1
  • 12
  • I believe that is the valid question - it's not about "what is better", but what could be the benefits of using it – Alex Ott May 03 '21 at 12:04

1 Answers1

0

Disclaimer: I didn't use Hyperspace myself, just read documentation & code examples.

Hyperspace by functionality is closer to the Data Skipping functionality of the Databricks Delta implementation - it allows to read only that data that is necessary. But on Databricks, indexing of data happens automatically when they are written, while with Hyperspace you need to build indexes & maintain them.

ZOrder is a different functionality - it optimizes placement of the data, so there is a higher probability that data that are used often together are really placed together, so you'll read less files. Hyperspace doesn't have that - it just indexes data, and placement of the data is defined by the underlying file format.

P.S. Here is the good blog post from Databricks about Data Skipping and ZOreder.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Do you a reference for : "But on Databricks, indexing of data happens automatically when they are written" ? I am not sure if this is necessarily true. – Nitish Upreti May 17 '21 at 18:41
  • The statistics is collected automatically on first N columns (configurable, 32 by default). The same for bloom filters – Alex Ott May 17 '21 at 19:09