separating dates and getting all permutations of products in Pandas UDF

Question

I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to recreate the error:

import pandas as pd
import itertools
df = spark.createDataFrame([('11/30/15,11/30/18','11/30/18,11/30/18'), ('11/30/15,11/30/18','11/30/15,11/30/18')], ['colname1', 'colname2'])

schema = StructType([StructField('Product', StringType(), True))

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def calculate_courses_final_df(this_row):
    this_row_course_date_obj_list = this_row['colname1']
    this_row_course_date_obj_list1 = this_row['colname2']
    return pd.DataFrame(list(itertools.product(this_row_course_date_obj_list.str.split(','),this_row_course_date_obj_list1.str.split(','))))
df1 = df.withColumn("id", monotonically_increasing_id())
df2 = df1.groupby('id')
df3 = df2.apply(calculate_courses_final_df)
df3.show()

Here is what df1 looks like:

+-----------------+-----------------+-----------+
|         colname1|         colname2|         id|
+-----------------+-----------------+-----------+
|11/30/15,11/30/18|11/30/18,11/30/18|25769803776|
|11/30/15,11/30/18|11/30/15,11/30/18|60129542144|
+-----------------+-----------------+-----------+

So the output should look like the following:

+----------------------------------------------------------------+
|[('11/30/15', '11/30/15'),('11/30/15', '11/30/18'), ('11/30/18',| |'11/30/15'),('11/30/18', '11/30/18')]                           |
+----------------------------------------------------------------+

Here is the error that I'm getting:

PythonException: An exception was thrown from a UDF: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object'. Full traceback below:
Traceback (most recent call last):

The desired output line is not clear. Does it represent just one line and the second line is missing? Why do you get `('11/30/15', '11/30/15')` if `11/30/15` does not exist in column2 in the first line? How should the 2nd line result look like? — ZygD, Oct 03 '22 at 07:34
The output is actually one column. I think I should edit the code to make that more clear. The output is and itertools product of the 2 lists which are comma separated and i use split to make the string into a list. — Matt, Oct 05 '22 at 02:34
Please edit the question as much as needed to make it answerable. — ZygD, Oct 05 '22 at 07:56

separating dates and getting all permutations of products in Pandas UDF

0 Answers0