pyspark cartesian product (crossJoin) with GroupedData

Question

I have a pyspark dataframe

+---+----+----+
|key|col1|col2|
+---+----+----+
|a  |5.4 |   1|
|a  |6.5 |   2|
|b  |7.5 |   3|
|b  |4.5 |   4|
|c  |6.4 |   1|
+--------+----+

I want to do Cartesian product but not between each row, but between each groupby("key"), and then apply some python function on it. Meaning, to do groupby("key") and then do Cartesian product (crossJoin) with each GroupedData (a with b, a with c, b with c).

Expected output should be a Dataframe with predefined scheme.

schema = StructType([
    StructField("some_col_1", StringType(), False),
    StructField("some_col_2", StringType(), False)
])

So the custom function should be something like:

def custom_func(df_1: pd.DataFrame, df_2: pd.DataFrame) -> pd.DataFrame

or (can be spark DataFrame instead of python DataFrame):

def custom_func(df_1: DataFrame, df_2: DataFrame) -> DataFrame

I tried doing two groupby and then use cogroup:

group1 = df.groupby("key")
group2 = df.groupby("key")
res = group1.cogroup(group2).applyInPandas(custom_func, schema)

But it doesn't do it as Cartesian product. I tried using crossJoin but it only applies dataframes. How can I apply it on GroupedData? Is there any way in of doing it?

edit: adding a picture to better explain the question

What's your expected output? And how are you going to apply the custom function on the two groups? Can you provide more details? — mck, May 23 '21 at 08:53
@mck This is a dummy example, my custom function is complicated so I didn't include it in the question. From each crossJoin I need to get a DataFrame back (with a predefined scheme). Not sure what other details to give "^^ — nogmos, May 23 '21 at 09:03
ok, read it, but still, what information is missing? I answer what is the expected output, how can I apply the custom function on the two groups is part of the question, using cogroup is an option, but any other way can work. — nogmos, May 23 '21 at 09:43
can you just do a cross join and filter the rows where the keys are not equal? — mck, May 23 '21 at 10:31
I need to run some_func on each "pair" of keys, how can I do it with cross_join and filter? — nogmos, May 23 '21 at 10:34
sorry, can't see how can I do it with just cross_join. I added a picture to maybe try and better explain the question. If you think it's possible with cross_join, please write it as an answer and I'll accept it if it will work :) — nogmos, May 23 '21 at 11:14

mck · Answer 1 · 2021-05-23T13:15:34.817

1

As a first step, you can do a cross join and filter the rows where the keys are different:

df2 = df.alias('a').join(
    df.alias('b'), 
    F.expr('a.key < b.key')
).toDF(
    *[c+'_a' for c in df.columns], 
    *[c+'_b' for c in df.columns])

df2.show()
+-----+------+------+-----+------+------+
|key_a|col1_a|col2_a|key_b|col1_b|col2_b|
+-----+------+------+-----+------+------+
|    a|   5.4|     1|    b|   7.5|     3|
|    a|   5.4|     1|    b|   4.5|     4|
|    a|   5.4|     1|    c|   6.4|     1|
|    a|   6.5|     2|    b|   7.5|     3|
|    a|   6.5|     2|    b|   4.5|     4|
|    a|   6.5|     2|    c|   6.4|     1|
|    b|   7.5|     3|    c|   6.4|     1|
|    b|   4.5|     4|    c|   6.4|     1|
+-----+------+------+-----+------+------+

Then you can apply your custom function on each row, e.g.

df2.groupBy('key_a', 'key_b').applyInPandas(...)

edited May 23 '21 at 13:15

answered May 23 '21 at 11:53

mck

40,932
13
35
50

The problem is that I don't need it to be done on every row, the function needs to get the whole GroupedData (see the picture I added). – nogmos May 23 '21 at 12:01
then you can do a group by on `a.key` and `b.key` and apply your function per group – mck May 23 '21 at 12:26
you mean to use now: res = group1.cogroup(group2).applyInPandas(custom_func, schema), right? feels like a somewhat inefficient solution. My dataset in reality is not so small, I have about 300 distinct keys and over 100 columns. Can you think of another solution? I think it's best to group the data and then do the Cartesian product, if possible. – nogmos May 23 '21 at 12:39
no, just a simple group by both a.key and b.key and then applyinpandas – mck May 23 '21 at 12:57
I did: df2 = df.alias('a').join(df.alias('b'), F.expr('a.key < b.key')) grouped_data = df2.groupby("a.key", "b.key") res = grouped_data.applyInPandas(some_func, schema) but I'm getting "pyspark.sql.utils.AnalysisException: Reference 'key' is ambiguous, could be: a.key, b.key." – nogmos May 23 '21 at 13:05
see edited answer - you might need to change the column names – mck May 23 '21 at 13:15
I'm sorry, I don't think it's a good solution for my problem... my dataset is just too big for that, and it's very complicating to multiple it like that and changing all the column names (I need all of them to do some calculations). I'm upvoting your answer but I can't accept it. – nogmos May 23 '21 at 13:52
I'm just wondering how your function could work without renaming the columns, because there will be duplicate column names, and your function can't identify which is which? – mck May 23 '21 at 13:55
you absolutely right, this is why I though the solution has to be to group the data first by the key, and them somehow to do the Cartesian product on all the groups (so the function some_func will get two dataframes, one of each key with all the columns). I just don't know how to do it. – nogmos May 23 '21 at 13:58

pyspark cartesian product (crossJoin) with GroupedData

1 Answers1