I am using pysaprk for this:
While applying dropduplicates , I want to remove both occurrences of matched row.
the dataset:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| A|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 5| E|
| 4| 3| G|
+----+----+----+
what I need :
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| 1| C|
| 1| 2| D|
| 4| 3| G|
+----+----+----+
I have tried to use unique but, Unique applies on all of the columns.
diff_df = source_df.union(target_df).dropDuplicates(columns_list)