PySpark - data overwritten in Partition

Question

I am seeing a situation where when save a pyspark dataframe to a hive table with multiple column partition, it overwrites the data in subpartition too. Or - may be I am assuming it is a subpartition.

I want to treat the column 'month' as subpartition. So that, I can see 4 records (in hive table) instead of 2 when I save df2 to the same table.

mode=append will work. But, if year & month are same, I want data to be overwritten. Is there a way to do this when saving a pyspark dataframe?

>>> df1 = spark.sql('select * from test_input')
>>> df1.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   01|
|  c|  d|2018|   01|
+---+---+----+-----+

>>> df1.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   01|
|  c|  d|2018|   01|
+---+---+----+-----+

>>> df2 = spark.sql('select * from test_input')
>>> df2.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   02|
|  c|  d|2018|   02|
+---+---+----+-----+

>>> df2.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   02|
|  c|  d|2018|   02|
+---+---+----+-----+

score 0 · Accepted Answer · answered Mar 01 '18 at 22:47

0

It seems like you misunderstand the concept of partitioning.

This is not window function partitioning that you would come across in SQL statement; it instead refers to the way data is stored and referenced in memory or on a file system. Here's a helpful introduction.

Changing the partitioning of a Spark dataframe will never alter the number of rows in that dataframe.

answered Mar 01 '18 at 22:47

Kirk Broadhurst

27,836
16
104
169

Thanks for the reply. I am not asking about number of rows in the dataframe. I was asking about why would spark treat year=2018,month=01 and year=2018,month=02 as same partition? These are partitions in a hive table - not the dataframe. – Prasanna Saraswathi Krishnan Mar 01 '18 at 23:30
You should show the files on the filesystem, not the results of a spark query. The query results won't change. – Kirk Broadhurst Mar 01 '18 at 23:33
Not sure I understand. I am querying the Hive table after inserting records into it. How can the query result not change when additional rows are added to the hive table? That is my question anyway. When I write some data into partition (col1=a,col2=b) and I write some other data into partition (col1=a,col2=c) . - since col2 value is changing, shouldn't they be considered two different partitions? – Prasanna Saraswathi Krishnan Mar 02 '18 at 16:05
You are overwriting the table (`mode='overwrite')`. Maybe that's where the confusion comes from. Granted the file structure will be different, given the partition option, but the `overwrite` option means the entire table will be replaced. – Kirk Broadhurst Mar 02 '18 at 16:10
1

So - the (mode=overwrite) will always overwrite the entire table and not just the partition mentioned? Because - Hive query like - Insert Overwrite table (partition=) - will only overwrite the partition mentioned and not the entire table. Is there a way to achieve the same behavior from a data frame? – Prasanna Saraswathi Krishnan Mar 02 '18 at 17:31

PySpark - data overwritten in Partition

1 Answers1