0

I'm working on a Spark cluster using PySpark and Hive.

I've seen a lot of questions here on SO regarding "Cannot overwrite table that is also being read from" Hive error. I understood this comes from how Hive itself works, and we are planning to move to Kudu for this: but right now I need to have the job done with Hive. I've found some solutions, almost all relying on the "break the data lineage" principle:

  • save on temp table, read back and then save on the right table
  • checkpoint the dataframe

and similar ones. I don't want to use them mainly because they're not efficient on large datasets, so I came up with this:

bananaDF = spark.sql("select * from banana")
// hundreds of trasformations - no actions
bananaDF.write.mode("overwrite").saveAsTable("banana_temp")
spark.sql("DROP TABLE IF EXISTS banana")
spark.sql("ALTER TABLE banana_temp RENAME TO banana")

Basically, I'm just

  • writing my updated table to a temp one
  • dropping the original table
  • renaming the temp table to match the original name

With this approach I'm writing the data once, and the drop-rename part is fairly efficient (around 1sec for 3K records table). Since I know I'm not that brilliant in coding... can anybody spot out what can go wrong with this? It seems to be an easy workaround, but I haven't find anything similar on the Internet (and this makes me very suspicious).

Thanks!

ercaran
  • 23
  • 1
  • 1
  • 8

1 Answers1

0

Why don't you instead use insert overwrite? Although your way may work, it is not as robust as using the declared approach from Hive. Also, this will reduce any possible location drift in where the table files are stored in your blob store like s3.

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

Hive Docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Another cool option would be to just use delta instead of hive which is pretty straigtforward if you are working in a databricks environment. The reasoning for this is that it allows for quicker insertions via upsert. See more here: https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html

  • I still have to test it, but I believe I cannot insertOverwrite from Spark. From what I've understood searching online, Spark is the problem. Meanwhile, I found a similar question (and answer) [here](https://stackoverflow.com/questions/46517244/how-can-you-perform-a-insert-overwrite-using-spark). I'll update you as soon as I have time to try your suggestion. Tx – ercaran Aug 26 '20 at 09:23