Spark SQL - Hive "Cannot overwrite table" workaround

Question

I'm working on a Spark cluster using PySpark and Hive.

I've seen a lot of questions here on SO regarding "Cannot overwrite table that is also being read from" Hive error. I understood this comes from how Hive itself works, and we are planning to move to Kudu for this: but right now I need to have the job done with Hive. I've found some solutions, almost all relying on the "break the data lineage" principle:

save on temp table, read back and then save on the right table
checkpoint the dataframe

and similar ones. I don't want to use them mainly because they're not efficient on large datasets, so I came up with this:

bananaDF = spark.sql("select * from banana")
// hundreds of trasformations - no actions
bananaDF.write.mode("overwrite").saveAsTable("banana_temp")
spark.sql("DROP TABLE IF EXISTS banana")
spark.sql("ALTER TABLE banana_temp RENAME TO banana")

Basically, I'm just

writing my updated table to a temp one
dropping the original table
renaming the temp table to match the original name

With this approach I'm writing the data once, and the drop-rename part is fairly efficient (around 1sec for 3K records table). Since I know I'm not that brilliant in coding... can anybody spot out what can go wrong with this? It seems to be an easy workaround, but I haven't find anything similar on the Internet (and this makes me very suspicious).

Thanks!

David Goodfellow · Answer 1 · 2020-08-25T17:11:58.350

Why don't you instead use insert overwrite? Although your way may work, it is not as robust as using the declared approach from Hive. Also, this will reduce any possible location drift in where the table files are stored in your blob store like s3.

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

Hive Docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Another cool option would be to just use delta instead of hive which is pretty straigtforward if you are working in a databricks environment. The reasoning for this is that it allows for quicker insertions via upsert. See more here: https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

I still have to test it, but I believe I cannot insertOverwrite from Spark. From what I've understood searching online, Spark is the problem. Meanwhile, I found a similar question (and answer) [here](https://stackoverflow.com/questions/46517244/how-can-you-perform-a-insert-overwrite-using-spark). I'll update you as soon as I have time to try your suggestion. Tx — ercaran, Aug 26 '20 at 09:23

Spark SQL - Hive "Cannot overwrite table" workaround

1 Answers1