How to return a list of double in a Pyspark UDF?

Question

from pyspark.sql import functions as func

I have a Pyspark Dataframe, which is called df. It has the following schema:

id: string
item: string
data: double

I apply on it the following operation:

grouped_df = df.groupBy(["id", "item"]).agg(func.collect_list(df.data).alias("dataList"))

Also, i defined the user defined function iqrOnList:

@udf
def iqrOnList(accumulatorsList: list):
  import numpy as np 

  Q1 = np.percentile(accumulatorsList, 25)
  Q3 = np.percentile(accumulatorsList, 75) 
  IQR = Q3 - Q1

  lowerFence = Q1 - (1.5 * IQR)
  upperFence = Q3 + (1.5 * IQR)

  return [elem if (elem >= lowerFence and elem <= upperFence) else None for elem in accumulatorsList]

I used this UDF in this way:

grouped_df = grouped_df.withColumn("SecondList", iqrOnList(grouped_df.dataList))

Those operations return in output the dataframe grouped_df, which is like this:

id: string
item: string
dataList: array
SecondList: string

Problem:

SecondList has exactly the correct value i expect (for example [1, 2, 3, null, 3, null, 2]), but with the wrong return type (string instead of array, even though it keeps the form of it).

The problem is I need it to be stored as an array, exactly as dataList is.

Questions:

1) How can i save it with the correct type?

2) This UDF is expensive in term of performance. I read here that Pandas UDF's performance are way better than common UDF. What is the equivalent of this method in Pandas UDF?

Bonus question (less priority): func.collect_list(df.data) does not collect null values, which df.data has. I'd like to collect them too, how can i do without replacing all null values with another default value?

When you define a `udf` you have to specify the return type, otherwise it will default to `StringType`. You probably want `ArrayType(DoubleType())` — pault, Nov 12 '19 at 21:36

score 1 · Answer 1 · answered Nov 12 '19 at 22:30

1

You can still use your current syntax, just need to provide return type in annotation declaration

import pyspark.sql.types as Types
@udf(returnType=Types.ArrayType(Types.DoubleType()))

answered Nov 12 '19 at 22:30

Sagar

373
1
6

2

That solved the type issue. Do you know how to write that UDF as a Pandas UDF? – ciurlaro Nov 13 '19 at 08:34

How to return a list of double in a Pyspark UDF?

1 Answers1