I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to recreate the error:
import pandas as pd
import itertools
df = spark.createDataFrame([('11/30/15,11/30/18','11/30/18,11/30/18'), ('11/30/15,11/30/18','11/30/15,11/30/18')], ['colname1', 'colname2'])
schema = StructType([StructField('Product', StringType(), True))
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def calculate_courses_final_df(this_row):
this_row_course_date_obj_list = this_row['colname1']
this_row_course_date_obj_list1 = this_row['colname2']
return pd.DataFrame(list(itertools.product(this_row_course_date_obj_list.str.split(','),this_row_course_date_obj_list1.str.split(','))))
df1 = df.withColumn("id", monotonically_increasing_id())
df2 = df1.groupby('id')
df3 = df2.apply(calculate_courses_final_df)
df3.show()
Here is what df1 looks like:
+-----------------+-----------------+-----------+
| colname1| colname2| id|
+-----------------+-----------------+-----------+
|11/30/15,11/30/18|11/30/18,11/30/18|25769803776|
|11/30/15,11/30/18|11/30/15,11/30/18|60129542144|
+-----------------+-----------------+-----------+
So the output should look like the following:
+----------------------------------------------------------------+
|[('11/30/15', '11/30/15'),('11/30/15', '11/30/18'), ('11/30/18',| |'11/30/15'),('11/30/18', '11/30/18')] |
+----------------------------------------------------------------+
Here is the error that I'm getting:
PythonException: An exception was thrown from a UDF: 'pyarrow.lib.ArrowTypeError: Expected bytes, got a 'list' object'. Full traceback below:
Traceback (most recent call last):