I am stuck on what seems to be a simple problem, but I can't see what I'm doing wrong, or why the expected behavior of .dropDuplicates() is not working.
a variable I use:
print type(pk)
<type 'tuple'>
print pk
('column1', 'column4')
I have a dataframe:
df_new.show()
+-------+----------------+---------+-------+-------------+-----------------+
|column1| column2| column3|column4|dml_operation| ingest_date|
+-------+----------------+---------+-------+-------------+-----------------+
| data6| z| update| z| 2|20190308190720942|
| data7| y| update| y| 2|20190308190720942|
| data8| x| update| x| 2|20190308190720942|
| data9| f| f| f| 0|20190308190720942|
| data1| d| b| c| 2|20190308190720942|
| data4| f| c| b| 1|20190308190720942|
| data3| a| b| b| 0|20190308190720942|
| date6|this should drop|more text| z| 2|20190308190720942|
| data8|this should drop| here| x| 1|20190308190720942|
| date6|this should drop|more text| z| 0|20190308190720942|
+-------+----------------+---------+-------+-------------+-----------------+
then I perform:
print_df = df_new.dropDuplicates(pk)
print_df.show()
+-------+----------------+---------+-------+-------------+-----------------+
|column1| column2| column3|column4|dml_operation| ingest_date|
+-------+----------------+---------+-------+-------------+-----------------+
| data3| a| b| b| 0|20190308190720942|
| date6|this should drop|more text| z| 2|20190308190720942|
| data7| y| update| y| 2|20190308190720942|
| data8| x| update| x| 2|20190308190720942|
| data9| f| f| f| 0|20190308190720942|
| data4| f| c| b| 1|20190308190720942|
| data6| z| update| z| 2|20190308190720942|
| data1| d| b| c| 2|20190308190720942|
+-------+----------------+---------+-------+-------------+-----------------+
As you can see the function works as expected for the rows containing "data8 and x" but does only drops one of the two duplicates for "data6 and z". This is what I can't figure out.
Some things I have already ruled out: - column types - wrong type of pk being fed in - manually passed in column names to double check
The only other thing I can think of is that the data is being partitioned and to my knowledge .dropDuplicates() only keeps the first occurrence in each partition (see here: spark dataframe drop duplicates and keep first). This seems unlikely in my case as my test data is small.
I'm out of ideas. Does anyone see why this behavior is happening?