I have a single transformation whose sole purpose is to drop duplicates. When using PySpark 2.x, the resulting output removes some duplicates, but not all. When using Apache Spark Java 2.x, the resulting output is as expected with all duplicates removed.
I am currently running Spark on YARN. My dataset is roughly 125 millions rows by 200 columns. It is expected that certain columns contain null values. For my use case, I do indeed have pure duplicates (reasons for this are out of scope). So far, I have tried:
- dropDuplicates(df.columns) / dropDuplicates(), PySpark -> drops some but not all duplicates
- distinct(), PySpark -> drops some but not all duplicates, different row count than 1.
- dropDuplicates([primary_key_I_created]), PySpark -> works
- dropDuplicates(dataset.columns()), Apache Spark Java -> works
I inspected the physical plans, and both method 1 and method 4 produce identical plans. They are roughly as follows:
+-HashAggregate(keys=[column_1, column_2, .... 198 more fields], functions=[], output=[column_1, column_2, ... 198 more fields]
+-Exchange hashpartitioning(column_1, column_2, ... 198 more fields)
+-HashAggregate(keys=[column_1, column_2, .... 198 more fields], functions=[], output=[column_1, column_2, ... 198 more fields]
+-FileScan parquet ....
Below is an example of a pair of rows that are duplicates that did not get dropped. I confirmed that there are no weird whitespace errors by running dropDuplicates() on JUST those two rows. That run worked as expected by returning a single row.
column_1 | column_2 | column_3 | column_4 | column_5| ..... column_200
bob jones **null** **null** 200.00 30
bob jones **null** **null** 200.00 30
Is there something happening under the hood that would cause PySpark to fail, but Spark Java to succeed (apologies for my vague jargon here)? Thanks in advance.