Pyspark: insert dataframe into partitioned hive table

Question

Apologies if I'm being really basic here but I need a little Pyspark help trying to dynamically overwrite partitions in a hive table. Tables are drastically simplified, but the issue I'm struggling with is (I hope) clear. I'm pretty new to PySpark and have been searching through StackOverflow for enough hours to finally create an account and ask...! Thanks in advance!!

I have a massive partitioned hive table (HIVETABLE_TRX) built from a dataframe(trx). I extract more data as a dataframe (trxup) and want to append or overwrite as appropriate the relevant partitions in HIVETABLE_TRX.

Dataframe (trx)

+---------------+----------+------+
|PRODUCT_LN_NAME|LOCAL_DATE|   TRX|
+---------------+----------+------+
|          HOTEL|2019-01-01|14298 |
|          HOTEL|2019-01-02|19020 |
|          HOTEL|2019-01-03|18927 |
+---------------+----------+------+

trx.write \
    .partitionBy("PRODUCT_LN_NAME","LOCAL_DATE") \
    .saveAsTable("HIVETABLE_TRX",mode='overwrite')

#Have a look at the partitioned hive table
trxchk = spark.sql("""select * from HIVETABLE_TRX""")
trxchk.show()

+------+---------------+----------+
|   TRX|PRODUCT_LN_NAME|LOCAL_DATE|
+------+---------------+----------+
|14298 |          HOTEL|2019-01-01|
|19020 |          HOTEL|2019-01-02|
|18927 |          HOTEL|2019-01-03|
+------+---------------+----------+

The dataframe (trxup) to add to the Hive table has one overlapping row I want to overwrite ('HOTEL', '2019-01-03') and 3 incremental ones to append.

#Have a look at second dataframe (trxup)
+---------------+----------+------+
|PRODUCT_LN_NAME|LOCAL_DATE|   TRX|
+---------------+----------+------+
|         FLIGHT|2019-01-03|14410 |
|          HOTEL|2019-01-03|18927 |
|         FLIGHT|2019-01-04|15430 |
|          HOTEL|2019-01-04|19198 |
+---------------+----------+------+

I try to insert trxup into HIVETABLE_TRX as follows:

trxup.write \
    .insertInto("HIVETABLE_TRX",overwrite=True)

My understanding being that this will overwrite the one row common between both trxup and HIVETABLE_TRX and append the remaining ones.

#Have a look at HIVETABLE_TRX after the basic insertInto
trxchk2 = spark.sql("""select * from HIVETABLE_TRX""")
trxchk2.show()

+----+---------------+----------+
| TRX|PRODUCT_LN_NAME|LOCAL_DATE|
+----+---------------+----------+
|null|     2019-01-03|    14410 |
|null|     2019-01-03|    18927 |
|null|     2019-01-04|    15430 |
|null|     2019-01-04|    19198 |
+----+---------------+----------+

As you can see, it fails to align the columns by name and overwrites all existing partitions in HIVETABLE_TRX.

So: 1. How do I make sure the columns are aligned for the insertInto? - This is the best I could come up with and, although successful, doesn't feel like the that's how it should be done...?

colList = spark.sql("""select * from HIVETABLE_TRX""").columns
trxup.selectExpr(colList) \
    .write \
    .insertInto("HIVETABLE_TRX")

Can I insert the second df (trxup) into the partitioned hive table (HIVETABLE_TRX) just adding/overwriting the appropriate partitions?

Other things I've tried after much Google, Stackoverflow and soul searching:

Added options to interpreter

hive.exec.dynamic.partition = true
hive.exec.dynamic.partition.mode = nonstrict
spark.sql.sources.partitionOverwriteMode = dynamic

Tried to partitionBy the trxup on the insertInto

trxup.write \
    .partitionBy("PRODUCT_LN_NAME","LOCAL_DATE") \
    .insertInto("PROJECT_MERCH.AM_PARTITION_TEST_TRX",overwrite=True)

AnalysisException: u"insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;"

Removed overwrite=True from the insertInto, which actually did what I expected at this point if not what I want.

+------+---------------+----------+
|   TRX|PRODUCT_LN_NAME|LOCAL_DATE|
+------+---------------+----------+
|14298 |          HOTEL|2019-01-01|
|19020 |          HOTEL|2019-01-02|
|18927 |          HOTEL|2019-01-03|
|  null|     2019-01-03|    14410 |
|  null|     2019-01-03|    18927 |
|  null|     2019-01-04|    15430 |
|  null|     2019-01-04|    19198 |
+------+---------------+----------+

I realise that I could convert trxup to a partitioned hive table (HIVETABLE_TRXUP) and then merge them together, but this feels like it's not an optimal way to do it - sort of defeats the purpose of partitioning the table, no?

trxjoined = spark.sql("""select * from HIVETABLE_TRX t full outer join HIVETABLE_TRXUP tu on t.SITE_NAME=tu.SITE_NAME and t.LOCAL_DATE=tu.LOCAL_DATE""")
spark.sql("""drop table if exists HIVETABLE_TRX""")
spark.sql("""drop table if exists HIVETABLE_TRXUP""")
trxjoined.write \
    .partitionBy("SITE_NAME","LOCAL_DATE") \
    .saveAsTable("HIVETABLE_TRX",mode='overwrite')

**OVERWRITE** will clear all the current data that's present in the table and populate the same with the records in the Dataframe. If your intention is to have the dataframe trxchk and trxup then you can perform the transformations within Spark and then send the final data to the Hive table. If I have understood your question wrongly please do clarify me. — DataWrangler, Oct 15 '19 at 04:44
@Joby Thanks for taking a look. The end goal is the combination of HIVETABLE_TRX and the dataframe trxup. The reason I'm not just combining in Spark is that the trx table is really big so I keep hitting TTransport errors, I was advised that being able to just replace/add only the relevant partitions would be a big help here. Does that make sense? — Amit, Oct 15 '19 at 14:23
@joby - Amended the post to hopefully make it clearer. Thanks! — Amit, Oct 15 '19 at 14:58

Pyspark: insert dataframe into partitioned hive table

0 Answers0