Pyspark: How to remove both occurrences in dropduplicates for only few columns

Question

I am using pysaprk for this:

While applying dropduplicates , I want to remove both occurrences of matched row.

the dataset:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   1|   A|
|   1|   1|   A|
|   2|   1|   C|
|   1|   2|   D|
|   3|   5|   E|
|   3|   5|   E|
|   4|   3|   G|
+----+----+----+

what I need :

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   1|   C|
|   1|   2|   D|
|   4|   3|   G|
+----+----+----+

I have tried to use unique but, Unique applies on all of the columns.

diff_df = source_df.union(target_df).dropDuplicates(columns_list)

Possible duplicate of [Remove all rows that are duplicates with respect to some rows](https://stackoverflow.com/questions/51522719/remove-all-rows-that-are-duplicates-with-respect-to-some-rows) and some good information in this related post: [Keep only duplicates from a DataFrame regarding some field](https://stackoverflow.com/questions/49559994/keep-only-duplicates-from-a-dataframe-regarding-some-field) — pault, Nov 14 '19 at 15:05

score 0 · Answer 1 · answered Nov 14 '19 at 13:55

This is not an elegant approach but gives an idea

>>> df = spark.createDataFrame([(1,25),(1,20),(1,20),(2,26)],['id','age'])

>>> df.show()
+---+---+
| id|age|
+---+---+
|  1| 25|
|  1| 20|
|  1| 20|
|  2| 26|
+---+---+

>>> df.groupBy([c for c in df.columns]).agg(count('id').alias('c')).show()
+---+---+---+
| id|age|  c|
+---+---+---+
|  1| 25|  1|
|  1| 20|  2|
|  2| 26|  1|
+---+---+---+

>>> df.groupBy([c for c in df.columns]).agg(count('id').alias('c')).filter('c=1').show()
+---+---+---+
| id|age|  c|
+---+---+---+
|  1| 25|  1|
|  2| 26|  1|
+---+---+---+

score 0 · Answer 2 · answered Nov 15 '19 at 06:34

you can use Window function to avoid multiple time aggregation here is the code snippet.

from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.createDataFrame([(1,25),(1,20),(1,20),(2,26)],['id','age'])
df.withColumn("test",f.count("*").over(Window.partitionBy("id","age").orderBy("id","age"))).filter(f.col("test")==1).show()
+---+---+----+                                                                  
| id|age|test|
+---+---+----+
|  1| 25|   1|
|  2| 26|   1|
+---+---+----+

let me know if you have any doubt related to same.

score 0 · Accepted Answer · answered Nov 15 '19 at 06:57

0

I think this can be fulfilled by full outer join and giving list of columns and then filter it out.

diff_df = source_df.join(target_df, columns_list, how='full')

filter:

diff_df.filter(diff_df.col3.isNull() | diff_df.col3.isNull())

answered Nov 15 '19 at 06:57

sitaram chikkala

171
2
16

Pyspark: How to remove both occurrences in dropduplicates for only few columns

3 Answers3