5

I have a single transformation whose sole purpose is to drop duplicates. When using PySpark 2.x, the resulting output removes some duplicates, but not all. When using Apache Spark Java 2.x, the resulting output is as expected with all duplicates removed.

I am currently running Spark on YARN. My dataset is roughly 125 millions rows by 200 columns. It is expected that certain columns contain null values. For my use case, I do indeed have pure duplicates (reasons for this are out of scope). So far, I have tried:

  1. dropDuplicates(df.columns) / dropDuplicates(), PySpark -> drops some but not all duplicates
  2. distinct(), PySpark -> drops some but not all duplicates, different row count than 1.
  3. dropDuplicates([primary_key_I_created]), PySpark -> works
  4. dropDuplicates(dataset.columns()), Apache Spark Java -> works

I inspected the physical plans, and both method 1 and method 4 produce identical plans. They are roughly as follows:

+-HashAggregate(keys=[column_1, column_2, .... 198 more fields], functions=[], output=[column_1, column_2, ... 198 more fields]
  +-Exchange hashpartitioning(column_1, column_2, ... 198 more fields)
    +-HashAggregate(keys=[column_1, column_2, .... 198 more fields], functions=[], output=[column_1, column_2, ... 198 more fields]
      +-FileScan parquet ....

Below is an example of a pair of rows that are duplicates that did not get dropped. I confirmed that there are no weird whitespace errors by running dropDuplicates() on JUST those two rows. That run worked as expected by returning a single row.

column_1 | column_2 | column_3 | column_4 | column_5| ..... column_200
 bob        jones     **null**   **null**   200.00           30
 bob        jones     **null**   **null**   200.00           30

Is there something happening under the hood that would cause PySpark to fail, but Spark Java to succeed (apologies for my vague jargon here)? Thanks in advance.

Jesse
  • 51
  • 3
  • Does `df.dropDuplicates()` (i.e. without parameter) also work? – cronoik Oct 26 '19 at 20:54
  • Neither df.dropDuplicates() nor df.distinct() works; they both drop some but not all duplicates. – Jesse Oct 27 '19 at 00:27
  • 2
    What do you mean with all duplicates? – thebluephantom Oct 27 '19 at 09:31
  • 1
    In addition to thebluephantom. Can you please give us an example of rows which weren't dropped? – cronoik Oct 27 '19 at 10:36
  • absolutely, I updated the question – Jesse Oct 27 '19 at 14:45
  • 1
    The [source](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.dropDuplicates) is calling `jdf = self._jdf.dropDuplicates() ` directly without any modification. How do you determine that one of the listed ways doesn't contain duplicates while another does? By calling df.count()? – cronoik Oct 28 '19 at 20:35
  • It may not be related to this problem but here's some good case study on drop duplicates .. https://stackoverflow.com/questions/38687212/spark-dataframe-drop-duplicates-and-keep-first/54738843# – vikrant rana Nov 11 '19 at 03:34
  • 2
    *a rather unsatisfying answer* I needed to upgrade to a later version of pySpark, and this seemed to resolve the issue ... but to answer questions above, yes I was calling count to verify row counts because I was seeing weird data issues downstream – Jesse Nov 12 '19 at 22:13
  • @Jesse..would you like to share us your findings as an answer here. Thanks – vikrant rana Nov 22 '19 at 06:10

0 Answers0