from pyspark.sql import functions as func
I have a Pyspark Dataframe, which is called df
. It has the following schema:
id: string
item: string
data: double
I apply on it the following operation:
grouped_df = df.groupBy(["id", "item"]).agg(func.collect_list(df.data).alias("dataList"))
Also, i defined the user defined function iqrOnList
:
@udf
def iqrOnList(accumulatorsList: list):
import numpy as np
Q1 = np.percentile(accumulatorsList, 25)
Q3 = np.percentile(accumulatorsList, 75)
IQR = Q3 - Q1
lowerFence = Q1 - (1.5 * IQR)
upperFence = Q3 + (1.5 * IQR)
return [elem if (elem >= lowerFence and elem <= upperFence) else None for elem in accumulatorsList]
I used this UDF in this way:
grouped_df = grouped_df.withColumn("SecondList", iqrOnList(grouped_df.dataList))
Those operations return in output the dataframe grouped_df
, which is like this:
id: string
item: string
dataList: array
SecondList: string
Problem:
SecondList
has exactly the correct value i expect (for example [1, 2, 3, null, 3, null, 2]
), but with the wrong return type (string
instead of array
, even though it keeps the form of it).
The problem is I need it to be stored as an array
, exactly as dataList
is.
Questions:
1) How can i save it with the correct type?
2) This UDF is expensive in term of performance. I read here that Pandas UDF's performance are way better than common UDF. What is the equivalent of this method in Pandas UDF?
Bonus question (less priority): func.collect_list(df.data)
does not collect null
values, which df.data
has. I'd like to collect them too, how can i do without replacing all null values with another default value?