It seems like there no way to do this for the time being.
As shown in SPARK-14922, the target version for this fix is 3.0.0 and it is still in progress.
As such there are two possible workarounds in my view.
Let's set up the problem using Spark 2.4.3:
// We create the table
spark.sql("CREATE TABLE IF NOT EXISTS potato (size INT) PARTITIONED BY (hour STRING)")
// Enable dynamic partitioning
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
// Insert some dummy records
(1 to 9).map(i => spark.sql(s"INSERT INTO potato VALUES ($i, '2020-06-07T0$i')"))
// Verify inserts
spark.table("potato").count // 9 records
Now... Trying to drop a single partition from inside spark works!
spark.sql("""ALTER TABLE potato DROP IF EXISTS PARTITION (hour='2020-06-07T01')""")
spark.table("potato").count // 8 records
Trying to drop multiple partitions doesn't work.
spark.sql("""ALTER TABLE potato DROP IF EXISTS PARTITION (hour="2020-06-07T02", hour="2020-06-07T03")""")
org.apache.spark.sql.catalyst.parser.ParseException:
Found duplicate keys 'hour'.(line 1, pos 34)
== SQL ==
ALTER TABLE potato DROP IF EXISTS PARTITION (hour="2020-06-07T02", hour="2020-06-07T03")
----------------------------------^^^
Using a comparison operator to drop a range of partitions also doesn't work.
spark.sql("""ALTER TABLE potato DROP IF EXISTS PARTITION (hour<="2020-06-07T03")""")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '<=' expecting {')', ','}(line 1, pos 49)
== SQL ==
ALTER TABLE potato DROP IF EXISTS PARTITION (hour<="2020-06-07T03")
-------------------------------------------------^^^
This supposedly happens because the partition column is a string and we are using comparison operators.
The solution I found is:
- Get the list of partitions and conditionally filter them.
- Either drop the individual partitions one by one, or pass them as a sequence of
[Map[String,String]
(TablePartitionSpec
) to the catalog's dropPartitions
function.
Step 1:
// Get External Catalog
val catalog = spark.sharedState.externalCatalog
// Get the spec from the list of partitions
val partitions = catalog.listPartitions("default", "potato").map(_.spec)
// Filter according to the condition you attempted.
val filteredPartitions = partitions.flatten.filter(_._2 <= "2020-06-07T03")
.map(t => Map(t._1 -> t._2))
Step 2:
We pass each argument tuple to an individual ALTER TABLE DROP PARTITION statement.
filteredPartitions.flatten.foreach(t =>
spark.sql(s"""ALTER TABLE potato DROP IF EXISTS PARTITION (${t._1}="${t._2}")"""))
spark.table("potato").count // 6 records
Or pass them to the Catalog's dropPartition
function.
// If you purge data, it gets deleted immediately and isn't moved to trash.
// This takes precedence over retainData, so even if you retainData but purge,
// your data is gone.
catalog.dropPartitions("default", "potato", filteredPartitions,
ignoreIfNotExists=true, purge=true, retainData=false)
spark.table("potato").count // 6 records
I hope that was helpful. Let me know if you have a better solution for Spark 2.x.