0

I have a delta table created as using spark 3.x and delta 0.7.x:

data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("tmp/delta-table")
# add some more files
data = spark.range(20, 100)
data.write.format("delta").mode("append").save("tmp/delta-table")

df = spark.read.format("delta").load("tmp/delta-table")
df.show()

Now quite some files are generated in the log (many way too small parquet files).

%ls tmp/delta-table

I want to compact them:

df.createGlobalTempView("my_delta_table")
spark.sql("OPTIMIZE my_delta_table ZORDER BY (id)")

fails with:

ParseException: 
mismatched input 'OPTIMIZE' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
OPTIMIZE my_delta_table ZORDER BY (id)
^^^

Question:

  1. How can I get this to work (optimize) without failing the query
  2. is there a more native API than calling out to the text-based SQL?

Notice:

spark is started like this:

import pyspark
from pyspark.sql import SparkSession

spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

from delta.tables import *
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

3

OPTIMIZE is not available in OSS Delta Lake. If you would like to compact files, you can follow instructions in the Compact files section. If you would like to use ZORDER, currently you need to use Databricks Runtime.

-- edit --

But it seems under development.

Kashyap
  • 15,354
  • 13
  • 64
  • 103
zsxwing
  • 20,270
  • 4
  • 37
  • 59
  • But: https://docs.delta.io/latest/delta-utility.html lists `OPTIMIZE` as supported? And even: `spark.sql("OPTIMIZE my_delta_table")` fails. However, https://github.com/delta-io/delta/issues/368 looks like your answer is still correct as of today. – Georg Heiler Aug 30 '20 at 06:39
  • Indeed, confirmed. It is not implemented. Further reading material: https://github.com/databricks/tech-talks/tree/master/2020-08-27%20%7C%20How%20Delta%20Lake%20Supercharges%20Data%20Lakes and https://www.youtube.com/watch?v=u1VfOiHVeMI – Georg Heiler Aug 30 '20 at 15:48
  • a `df.repartition(10, col("foo"), col("bar)).sortWithinPartitions`, potentially also a `repartitionByRange` could be a good approximation in OSS. – Georg Heiler Aug 30 '20 at 15:49
  • This will now work in > 1.2.0 https://docs.delta.io/1.2.0/optimizations-oss.html#compaction-bin-packing – Michael Hoffman May 14 '22 at 01:44
0

If you are running Delta locally that means you must be using the OSS Delta Lake. The optimize command is available only Databricks Delta Lake. For doing file compaction in OSS, you can do the following - https://docs.delta.io/latest/best-practices.html#compact-files

  • I am trying to run a compaction on a delta table (streaming sink) where the spark job is performing MERGE on this table for every 5 minutes. At the time of compaction is being executed on the table, if there are updates happening on the same table, how does the it work? Should we wait until the compaction is executed successfully before resuming the updates or spark can manage it? Also, are there any guidelines available to consider the repartition number? I mean, I have 1000 small files and is it best to repartition to 10 or to even 5? – Rak Apr 20 '21 at 19:26