0

I am working on creating a notebook which end users could run by providing the table name as input and get an efficient sample query(by utilising the partition key and Z order column). I could get the partition column with describe table or spark.catalog, but not able to find a way to get the Z order column from table metadata?

The code for getting the partition column is given below.

columns = spark.catalog.listColumns(tableName=tablename,dbName=dbname)
  partition_columns_details = list(filter(lambda c: c.isPartition , columns))
  partition_columns=[ (c.name) for c in partition_columns_details ]
TylerH
  • 20,799
  • 66
  • 75
  • 101
Nikesh
  • 47
  • 6

1 Answers1

1

Maybe first an important thing to know about the difference between partitioning and Z ordering. Once you partition a table on a column, this partition remains after each transaction. If you do an insert, update, optimize, ... the table will still be partitioned on that column. This is not the case for Z ordering. If you do Z ordering on a table during an optimize, and after that you do an insert, update, ... it is possible that the data is not well structured anymore.

That being said, here is an example to find which column(s) the last Z ordering was done on:

df = spark.sql(f'DESCRIBE HISTORY {table_name}')
(df.filter(F.col('operation')=='OPTIMIZE')
   .orderBy(F.desc('timestamp'))
   .select('operationParameters.zOrderBy')
   .collect()[0].zOrderBy)

You can probably expand the code above with some extra information. E.g. where there many other transactions done after the z ordering, ...? Note that not all transactions 'destroy' the Z ordering result. VACUUM for example will be registered within the history, but does not impact Z ordering. After a small INSERT you also will probably still benefit from the Z ordering that was done before.

gamezone25
  • 288
  • 2
  • 10