PySpark: Filter out rows where column value appears multiple times in dataframe

Question

Given a PySpark dataframe with two columns, I want to split the data set into two dataframes: One where the combination of ColA and ColB is unique, and one where it is non-unique.

So for example:

ColA | ColB | ColCDoesntMatter
1    | 2    | 4
1    | 3    | 3
2    | 2    | 2
1    | 2    | 6
2    | 2    | 8
2    | 3    | 9

The unique A/B pairings are:

1-3
2-3

The nonunique A/B pairings are:

1-2
2-2

So one dataframe would have all rows with the unique AB values and the other would contain the nonunique AB values. ColC doesn't matter for the filter, but needs to be retained. How can I run this filter?

Edit: Note that I can't use dropDuplicates because it's really the combination of A and B that needs to be unique, not merely A or B.

Use `groupBy` (or `Window`) and `count`. Take a look at [this post](https://stackoverflow.com/questions/48554619/count-number-of-duplicate-rows-in-sparksql/48554666#48554666). — pault, Feb 21 '19 at 20:31
Possible duplicate of [Keep only duplicates from a DataFrame regarding some field](https://stackoverflow.com/questions/49559994/keep-only-duplicates-from-a-dataframe-regarding-some-field) — pault, Feb 21 '19 at 20:40

score 2 · Accepted Answer · answered Feb 21 '19 at 20:31

# count rows per A-B
dfWithRowCount = df.selectExpr("*", "count('*') over (partition by (ColA, ColB)) as nrows")

# filter
uniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows == 1).drop('nrows')
nonUniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows > 1).drop('nrows')

nonUniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   2|   2|   2|
|   2|   2|   8|
|   1|   2|   4|
|   1|   2|   6|
+----+----+----+

uniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   3|   3|
|   2|   3|   9|
+----+----+----+

PySpark: Filter out rows where column value appears multiple times in dataframe

1 Answers1