PySpark window over : Is there any better way to find row number for two columns partitions

Question

I have a dataset with 3 columns(T,S, and A). I need to filter out records such a way that T and S columns have one to one match.

e.g.If T1 is matched with S1 then T2 row with S1 value should be filtered.

I am able to achieve it using 2-time window over but it will do a lot of shuffling in the cluster during the second window function (First window shuffling I can control with df.sort/repartition).

l = [('T1', 'S1', 10), ('T2', 'S1', 10), ('T1', 'S2', 10), ('T2', 'S2', 10)]
df = spark.createDataFrame(l).toDF('T','S','A')
df.show()
+---+---+---+
|  T|  S|  A|
+---+---+---+
| T1| S1| 10|
| T2| S1| 10|
| T1| S2| 10|
| T2| S2| 10|
+---+---+---+

w1 = w.partitionBy('T').orderBy('A') 
w2 = w.partitionBy('S').orderBy('A','T') 
df.withColumn('r1', f.row_number().over(w1)).withColumn('r2',f.row_number().over(w2)).show()

It gives below result so I can filter records if r1 == r2 and get expected output.

+---+---+---+---+---+                                                           
|  T|  S|  A| r1| r2|
+---+---+---+---+---+
| T1| S2| 10|  2|  1|
| T2| S2| 10|  2|  2|
| T1| S1| 10|  1|  1|
| T2| S1| 10|  1|  2|
+---+---+---+---+---+

expected result

+---+---+---+---+---+                                                           
|  T|  S|  A| r1| r2|
+---+---+---+---+---+
| T2| S2| 10|  2|  2|
| T1| S1| 10|  1|  1|
+---+---+---+---+---+

Can you expand a bit on the rules here? If T1 is match with S1 then T2 with S1 is removed. What about T1 with S2, is that match allowed? (and if so, is T2 with S2 prioritized?) In other words, can the same T be matched with multiple Ss? Can the same S be match with multiple Ts? — Shaido, Aug 07 '19 at 05:48
Hi @Shaido, T can not be matched with multiple Ss. Priority wise any match is fine. I mean T1-S1, T2-S2 or T1-S2, T2-S1 can match. The rule is one T should match with one and only one S, vice versa, one S to one T. — SMaZ, Aug 07 '19 at 12:38
no, I dont need ```r1``` and ```r2 ``` columns. Those are just to identify one to one match. — SMaZ, Aug 07 '19 at 14:01
@SmitenShah your method seems to assume that if you have N Tx, then you will have also N Sy and N*N rows to have all the couples (Tx, Sy). Wondering because if you remove the row T2 and S1 for example, then your method will not keep the row T2, S2 while it is a unique match? — Ben.T, Aug 07 '19 at 20:21
@Ben.T Yes, If we have N Ts then we will have N Ss too. Here we don't need unique matches on T+S combination. Rule is if S1 is paired with T1 then it should never pair with any other Ts. In other word, valid pairs can be T1-S1, T2-S2 OR T1-S2, T2-S1 only. — SMaZ, Aug 07 '19 at 22:12
@SmitenShah ok, then maybe mapping the Ss could be a way. Collect all the Ss, create a dictionnary with something like `{key[0]:val+1 for val, key in enumerate(df.filter(f.col('T') == 'T1').select('S').collect())}`, and then use [this](https://stackoverflow.com/a/42983199/9274732) with this dictionnay to create the r2 column? — Ben.T, Aug 07 '19 at 22:57
@SmitenShah sorry I meant map the Ts to create r2, but you got the idea? — Ben.T, Aug 07 '19 at 23:22
Thanks @Ben.T, Yes got it but I have a huge dataset(~ 50 million records) and collecting column would take a lot of resources. Appreciate your response. — SMaZ, Aug 08 '19 at 03:30
can't you apply a inner join of the dataframe with itself specifying the two columns as keys? — ndricca, Aug 08 '19 at 14:10

PySpark window over : Is there any better way to find row number for two columns partitions

0 Answers0