Pyspark updating a particular partition of an external Hive table

Question

I am trying to overwrite a particular partition of a hive table using pyspark but each time i am trying to do that, all the other partitions are getting wiped off. I went through couple of posts in here regarding this and implemented the steps but seems like i am still getting and error. the code i am using is

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.format('parquet').mode('overwrite').partitionBy('col1').option("partitionOverwriteMode", "dynamic").saveAsTable(op_dbname+'.'+op_tblname)

Initially the partitions are like col1=m and col1=n and while i am trying to overwrite only the partition col1=m its wiping our col1=n as well.

Spark version is 2.4.4

Appreciate any help.

you could try to provide the path till the partition folder while writing, so it won't affect other partitions. e.g. `/foo/bar/baz/col1=m/` — samkart, Nov 08 '22 at 06:55
have you tried `partitionBy('col1=m')` and removing .option("partitionOverwriteMode", "dynamic") ? because if you mention partition name/value, its a static partition operation. — Koushik Roy, Nov 08 '22 at 12:22
i was going through the hivewiki and, i think you can use `spark.sql` to run `insert overwrite table partition (col1=m) select * from sparktable`. [see](https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-InsertingvaluesintotablesfromSQL:~:text=datafile%20is%20compressed.-,Inserting%20data%20into%20Hive%20Tables%20from%20queries,-Query%20Results%20can) — samkart, Nov 09 '22 at 11:16
thanks for writing back. Actually i dont want to hardcode the partition value as in my case the number of partitions can run more than 100+ and also they are not fixed i.e., the value can change based on the input feed. As such using the particular value will result of using additional logic to understand what are the partitions available in the current df and also split that into multiple such df to write in the table. Hence i am looking for more generic option to solve this without hardcoding the partition col value in the write command — Kaushik Ghosh, Nov 14 '22 at 16:06

score 1 · Answer 1 · answered Nov 18 '22 at 12:56

After multiple trial and error this is the method that i tried

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
op_df.write.format(op_fl_frmt).mode(operation_mode).option("partitionOverwriteMode", "dynamic").insertInto(op_dbname+'.'+op_tblname,overwrite=True)

When i tried to use the saveAsTable no matter what i do it is always wiping off all the values. And setting only the flag 'spark.sql.sources.partitionOverwriteMode' to dynamic doesnt seem to work. Hence using insertInto along with an overwrite flag inside that to achieve the desired output.

Pyspark updating a particular partition of an external Hive table

1 Answers1