pyspark delta lake optimize - fails to parse SQL

Question

I have a delta table created as using spark 3.x and delta 0.7.x:

data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("tmp/delta-table")
# add some more files
data = spark.range(20, 100)
data.write.format("delta").mode("append").save("tmp/delta-table")

df = spark.read.format("delta").load("tmp/delta-table")
df.show()

Now quite some files are generated in the log (many way too small parquet files).

%ls tmp/delta-table

I want to compact them:

df.createGlobalTempView("my_delta_table")
spark.sql("OPTIMIZE my_delta_table ZORDER BY (id)")

fails with:

ParseException: 
mismatched input 'OPTIMIZE' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
OPTIMIZE my_delta_table ZORDER BY (id)
^^^

Question:

How can I get this to work (optimize) without failing the query
is there a more native API than calling out to the text-based SQL?

Notice:

spark is started like this:

import pyspark
from pyspark.sql import SparkSession

spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

from delta.tables import *

score 3 · Accepted Answer · edited Feb 28 '22 at 22:42

3

OPTIMIZE is not available in OSS Delta Lake. If you would like to compact files, you can follow instructions in the Compact files section. If you would like to use ZORDER, currently you need to use Databricks Runtime.

-- edit --

But it seems under development.

edited Feb 28 '22 at 22:42

Kashyap

15,354
13
64
103

answered Aug 29 '20 at 16:49

zsxwing

20,270
4
37
59

But: https://docs.delta.io/latest/delta-utility.html lists `OPTIMIZE` as supported? And even: `spark.sql("OPTIMIZE my_delta_table")` fails. However, https://github.com/delta-io/delta/issues/368 looks like your answer is still correct as of today. – Georg Heiler Aug 30 '20 at 06:39
Indeed, confirmed. It is not implemented. Further reading material: https://github.com/databricks/tech-talks/tree/master/2020-08-27%20%7C%20How%20Delta%20Lake%20Supercharges%20Data%20Lakes and https://www.youtube.com/watch?v=u1VfOiHVeMI – Georg Heiler Aug 30 '20 at 15:48
a `df.repartition(10, col("foo"), col("bar)).sortWithinPartitions`, potentially also a `repartitionByRange` could be a good approximation in OSS. – Georg Heiler Aug 30 '20 at 15:49
This will now work in > 1.2.0 https://docs.delta.io/1.2.0/optimizations-oss.html#compaction-bin-packing – Michael Hoffman May 14 '22 at 01:44

score 0 · Answer 2 · answered Sep 09 '20 at 00:57

0

If you are running Delta locally that means you must be using the OSS Delta Lake. The optimize command is available only Databricks Delta Lake. For doing file compaction in OSS, you can do the following - https://docs.delta.io/latest/best-practices.html#compact-files

answered Sep 09 '20 at 00:57

Cristián Vargas Acevedo

580
11
16

I am trying to run a compaction on a delta table (streaming sink) where the spark job is performing MERGE on this table for every 5 minutes. At the time of compaction is being executed on the table, if there are updates happening on the same table, how does the it work? Should we wait until the compaction is executed successfully before resuming the updates or spark can manage it? Also, are there any guidelines available to consider the repartition number? I mean, I have 1000 small files and is it best to repartition to 10 or to even 5? – Rak Apr 20 '21 at 19:26

pyspark delta lake optimize - fails to parse SQL

2 Answers2