Apologies if I'm being really basic here but I need a little Pyspark help trying to dynamically overwrite partitions in a hive table. Tables are drastically simplified, but the issue I'm struggling with is (I hope) clear. I'm pretty new to PySpark and have been searching through StackOverflow for enough hours to finally create an account and ask...! Thanks in advance!!
I have a massive partitioned hive table (HIVETABLE_TRX) built from a dataframe(trx). I extract more data as a dataframe (trxup) and want to append or overwrite as appropriate the relevant partitions in HIVETABLE_TRX.
Dataframe (trx)
+---------------+----------+------+
|PRODUCT_LN_NAME|LOCAL_DATE| TRX|
+---------------+----------+------+
| HOTEL|2019-01-01|14298 |
| HOTEL|2019-01-02|19020 |
| HOTEL|2019-01-03|18927 |
+---------------+----------+------+
trx.write \
.partitionBy("PRODUCT_LN_NAME","LOCAL_DATE") \
.saveAsTable("HIVETABLE_TRX",mode='overwrite')
#Have a look at the partitioned hive table
trxchk = spark.sql("""select * from HIVETABLE_TRX""")
trxchk.show()
+------+---------------+----------+
| TRX|PRODUCT_LN_NAME|LOCAL_DATE|
+------+---------------+----------+
|14298 | HOTEL|2019-01-01|
|19020 | HOTEL|2019-01-02|
|18927 | HOTEL|2019-01-03|
+------+---------------+----------+
The dataframe (trxup) to add to the Hive table has one overlapping row I want to overwrite ('HOTEL', '2019-01-03') and 3 incremental ones to append.
#Have a look at second dataframe (trxup)
+---------------+----------+------+
|PRODUCT_LN_NAME|LOCAL_DATE| TRX|
+---------------+----------+------+
| FLIGHT|2019-01-03|14410 |
| HOTEL|2019-01-03|18927 |
| FLIGHT|2019-01-04|15430 |
| HOTEL|2019-01-04|19198 |
+---------------+----------+------+
I try to insert trxup into HIVETABLE_TRX as follows:
trxup.write \
.insertInto("HIVETABLE_TRX",overwrite=True)
My understanding being that this will overwrite the one row common between both trxup and HIVETABLE_TRX and append the remaining ones.
#Have a look at HIVETABLE_TRX after the basic insertInto
trxchk2 = spark.sql("""select * from HIVETABLE_TRX""")
trxchk2.show()
+----+---------------+----------+
| TRX|PRODUCT_LN_NAME|LOCAL_DATE|
+----+---------------+----------+
|null| 2019-01-03| 14410 |
|null| 2019-01-03| 18927 |
|null| 2019-01-04| 15430 |
|null| 2019-01-04| 19198 |
+----+---------------+----------+
As you can see, it fails to align the columns by name and overwrites all existing partitions in HIVETABLE_TRX.
So: 1. How do I make sure the columns are aligned for the insertInto? - This is the best I could come up with and, although successful, doesn't feel like the that's how it should be done...?
colList = spark.sql("""select * from HIVETABLE_TRX""").columns
trxup.selectExpr(colList) \
.write \
.insertInto("HIVETABLE_TRX")
- Can I insert the second df (trxup) into the partitioned hive table (HIVETABLE_TRX) just adding/overwriting the appropriate partitions?
Other things I've tried after much Google, Stackoverflow and soul searching:
Added options to interpreter
hive.exec.dynamic.partition = true
hive.exec.dynamic.partition.mode = nonstrict
spark.sql.sources.partitionOverwriteMode = dynamic
Tried to partitionBy the trxup on the insertInto
trxup.write \
.partitionBy("PRODUCT_LN_NAME","LOCAL_DATE") \
.insertInto("PROJECT_MERCH.AM_PARTITION_TEST_TRX",overwrite=True)
AnalysisException: u"insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;"
Removed overwrite=True from the insertInto, which actually did what I expected at this point if not what I want.
+------+---------------+----------+
| TRX|PRODUCT_LN_NAME|LOCAL_DATE|
+------+---------------+----------+
|14298 | HOTEL|2019-01-01|
|19020 | HOTEL|2019-01-02|
|18927 | HOTEL|2019-01-03|
| null| 2019-01-03| 14410 |
| null| 2019-01-03| 18927 |
| null| 2019-01-04| 15430 |
| null| 2019-01-04| 19198 |
+------+---------------+----------+
I realise that I could convert trxup to a partitioned hive table (HIVETABLE_TRXUP) and then merge them together, but this feels like it's not an optimal way to do it - sort of defeats the purpose of partitioning the table, no?
trxjoined = spark.sql("""select * from HIVETABLE_TRX t full outer join HIVETABLE_TRXUP tu on t.SITE_NAME=tu.SITE_NAME and t.LOCAL_DATE=tu.LOCAL_DATE""")
spark.sql("""drop table if exists HIVETABLE_TRX""")
spark.sql("""drop table if exists HIVETABLE_TRXUP""")
trxjoined.write \
.partitionBy("SITE_NAME","LOCAL_DATE") \
.saveAsTable("HIVETABLE_TRX",mode='overwrite')