-1

How can i delete all data and drop all partitions from a Hive table, using Spark 2.3.0

truncate table my_table; // Deletes all data, but keeps partitions in metastore

alter table my_table drop partition(p_col > 0) // does not work from spark

The only thing that was working for me was to iterate through show partitions my_table, replace / by , and drop each partition individually. But there must be a cleaner way. And it does not even work if the partition columns are of type string. Any suggestions?

Joha
  • 935
  • 12
  • 32

2 Answers2

2

Let's set up the problem using Spark 2.4.3:

// We create the table
spark.sql("CREATE TABLE IF NOT EXISTS potato (size INT) PARTITIONED BY (hour STRING)")

// Enable dynamic partitioning 
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")

// Insert some dummy records
(1 to 9).map(i => spark.sql(s"INSERT INTO potato VALUES ($i, '2020-06-07T0$i')"))

// Verify inserts
spark.table("potato").count // 9 records

We use the external catalog's listPartitions and dropPartitions functions.

// Get External Catalog
val catalog = spark.sharedState.externalCatalog

// Get the spec from the list of all partitions 
val partitions = catalog.listPartitions("default", "potato").map(_.spec)

// We pass them to the Catalog's dropPartitions function.
// If you purge data, it gets deleted immediately and isn't moved to trash.
// This takes precedence over retainData, so even if you retainData but purge,
// your data is gone.
catalog.dropPartitions("default", "potato", partitions,
                   ignoreIfNotExists=true, purge=true, retainData=false)
spark.table("potato").count // 0 records
catalog.listPartitions("default", "potato").length // 0 partitions

This works all well and good for MANAGED tables, but what about an EXTERNAL table?

// We repeat the setup above but after creating an EXTERNAL table
// After dropping we see that the partitions appear to be gone (or are they?).
catalog.listPartitions("default", "potato").length // 0 partitions

// BUT repairing the table simply adds them again, the partitions/data 
// were NOT deleted from the underlying filesystem. This is not what we wanted!
spark.sql("MSCK REPAIR TABLE potato")
catalog.listPartitions("default", "potato").length // 9 partitions again!   

To get by this, we alter the table from EXTERNAL to MANAGED before dropping the partitions.

import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType

// Identify the table in question
val identifier = TableIdentifier("potato", Some("default"))

// Get its current metadata
val tableMetadata = catalog.getTableMetadata(identifier)

// Clone the metadata while changing the tableType to MANAGED
val alteredMetadata = tableMetadata.copy(tableType = CatalogTableType.MANAGED)

// Alter the table using the new metadata
catalog.alterTable(alteredMetadata)

// Now drop!
catalog.dropPartitions("default", "potato", partitions,
                   ignoreIfNotExists=true, purge=true, retainData=false)
spark.table("potato").count // 0 records
catalog.listPartitions("default", "potato").length // 0 partitions
spark.sql("MSCK REPAIR TABLE potato") // Won't add anything
catalog.listPartitions("default", "potato").length // Still 0 partitions!

Don't forget to change the table back to EXTERNAL using CatalogTableType.EXTERNAL.

kfkhalili
  • 996
  • 1
  • 11
  • 24
0

Hive has two types of tables (managed tables and external tables). Managed tables are created for purposes where Hive manages the entire schema as well as Data. So, dropping a Hive managed table drops schema, metadata & data. However, an external table has data sitting somewhere else (let's say external source such as S3). So, dropping the table only drops the metadata and the table but data remains intact in the source.

In your case, when you are truncating the table, Hive is supposed to maintain the metastore as table is still present in Hive only data is removed. Also, metastore doesn't hold data as it contains only the information about schema and other related table details.

I hope it answers to some extent.

EDIT1:

Similar Post

Prashant
  • 702
  • 6
  • 21
  • Thank you for you answer. Yes it explains why truncate is not deleting metadata and i understand that, but unfortunately it does not solve my problem. – Joha May 20 '19 at 13:48
  • 1
    I found a similar query so added it to my answer. In may help in case you haven't seen it. – Prashant May 20 '19 at 15:19