I have a pyspark dataframe
+---+----+----+
|key|col1|col2|
+---+----+----+
|a |5.4 | 1|
|a |6.5 | 2|
|b |7.5 | 3|
|b |4.5 | 4|
|c |6.4 | 1|
+--------+----+
I want to do Cartesian product but not between each row, but between each groupby("key"), and then apply some python function on it. Meaning, to do groupby("key") and then do Cartesian product (crossJoin) with each GroupedData (a with b, a with c, b with c).
Expected output should be a Dataframe with predefined scheme.
schema = StructType([
StructField("some_col_1", StringType(), False),
StructField("some_col_2", StringType(), False)
])
So the custom function should be something like:
def custom_func(df_1: pd.DataFrame, df_2: pd.DataFrame) -> pd.DataFrame
or (can be spark DataFrame instead of python DataFrame):
def custom_func(df_1: DataFrame, df_2: DataFrame) -> DataFrame
I tried doing two groupby and then use cogroup:
group1 = df.groupby("key")
group2 = df.groupby("key")
res = group1.cogroup(group2).applyInPandas(custom_func, schema)
But it doesn't do it as Cartesian product. I tried using crossJoin but it only applies dataframes. How can I apply it on GroupedData? Is there any way in of doing it?