I want to use data.groupby.apply()
to apply a function to each row of my Pyspark Dataframe per group.
I used The Grouped Map Pandas UDFs. However I can't figure out how to add another argument to my function.
I tried using the argument as a global variable but the function doesn't recognize it (my argument is a pyspark dataframe)
I also tried the solutions proposed in this question (for pandas dataframe) Use Pandas groupby() + apply() with arguments
@pandas_udf(schema,PandasUDFType.GROUPED_MAP)
def function(key,data, interval):
interval_df=interval.filter(interval["var"]==key).toPandas()
for value in interval_df:
#Apply some operations
return Data.groupBy("msn").apply(calc_diff, ('arg1'))
Or
@pandas_udf(schema,PandasUDFType.GROUPED_MAP)
def function(key,data, interval):
interval_df=interval.filter(interval["var"]==key).toPandas()
for value in interval_df:
#Apply some operations
return Data.groupBy("msn").apply(lambda x: calc_diff(x,'arg1'))
But I get the error :
ValueError: Invalid function: pandas_udfs with function type GROUPED_MAP must take either one argument (data) or two arguments (key, data).
Could anyone help me with the above issue.
Thanks