I'm working on a Spark cluster using PySpark and Hive.
I've seen a lot of questions here on SO regarding "Cannot overwrite table that is also being read from" Hive error. I understood this comes from how Hive itself works, and we are planning to move to Kudu for this: but right now I need to have the job done with Hive. I've found some solutions, almost all relying on the "break the data lineage" principle:
- save on temp table, read back and then save on the right table
- checkpoint the dataframe
and similar ones. I don't want to use them mainly because they're not efficient on large datasets, so I came up with this:
bananaDF = spark.sql("select * from banana")
// hundreds of trasformations - no actions
bananaDF.write.mode("overwrite").saveAsTable("banana_temp")
spark.sql("DROP TABLE IF EXISTS banana")
spark.sql("ALTER TABLE banana_temp RENAME TO banana")
Basically, I'm just
- writing my updated table to a temp one
- dropping the original table
- renaming the temp table to match the original name
With this approach I'm writing the data once, and the drop-rename part is fairly efficient (around 1sec for 3K records table). Since I know I'm not that brilliant in coding... can anybody spot out what can go wrong with this? It seems to be an easy workaround, but I haven't find anything similar on the Internet (and this makes me very suspicious).
Thanks!