I am tring to get the combinations for each row of a dataframe. For example, my input data looks like,
+-------+---+
|kc_pair|h_v|
+-------+---+
| [a, 1]|123|
+-------+---+
| [b, 2]|123|
+-------+---+
| [c, 3]|123|
+-------+---+
| [b, 2]|234|
+-------+---+
| [c, 3]|234|
+-------+---+
The output combination dataframe should be grouped by the h_v column and it should be like,
+---------------+---+
| kc_pairs|h_v|
+---------------+---+
| [a, 1], [b, 2]|123|
+---------------+---+
| [a, 1], [c, 3]|123|
+---------------+---+
| [b, 2], [c, 3]|123|
+---------------+---+
| [b, 2], [c, 3]|234|
+---------------+---+
I've tried using itertools.combinations as a udf applied to the specific column. First aggregating the kc_pair with the same h_v value as a list, something like this,
+----------------------+---+
| kc_pairs|h_v|
+----------------------+---+
| [a, 1], [b, 2], [c,3]|123|
+----------------------+---+
| [b, 2], [c, 3]|234|
+----------------------+---+
And then applied the udf to column kc_pair
F.udf(lambda x: list(itertools.combinations(x, 2)),
returnType=ArrayType(ArrayType(StringType())))
Now, a critical issue is that it cannot avoid data skew which means if a cell in kc_pair contains over 10,000 elements, the worker might fail the task. Any idea to this problem?