I have a dataset with 3 columns(T,S, and A). I need to filter out records such a way that T and S columns have one to one match.
e.g.If T1 is matched with S1 then T2 row with S1 value should be filtered.
I am able to achieve it using 2-time window over but it will do a lot of shuffling in the cluster during the second window function (First window shuffling I can control with df.sort/repartition).
l = [('T1', 'S1', 10), ('T2', 'S1', 10), ('T1', 'S2', 10), ('T2', 'S2', 10)]
df = spark.createDataFrame(l).toDF('T','S','A')
df.show()
+---+---+---+
| T| S| A|
+---+---+---+
| T1| S1| 10|
| T2| S1| 10|
| T1| S2| 10|
| T2| S2| 10|
+---+---+---+
w1 = w.partitionBy('T').orderBy('A')
w2 = w.partitionBy('S').orderBy('A','T')
df.withColumn('r1', f.row_number().over(w1)).withColumn('r2',f.row_number().over(w2)).show()
It gives below result so I can filter records if r1 == r2 and get expected output.
+---+---+---+---+---+
| T| S| A| r1| r2|
+---+---+---+---+---+
| T1| S2| 10| 2| 1|
| T2| S2| 10| 2| 2|
| T1| S1| 10| 1| 1|
| T2| S1| 10| 1| 2|
+---+---+---+---+---+
expected result
+---+---+---+---+---+
| T| S| A| r1| r2|
+---+---+---+---+---+
| T2| S2| 10| 2| 2|
| T1| S1| 10| 1| 1|
+---+---+---+---+---+