Pyspark Pandas UDF Series operation on Array column

Question

I have a dataframe like this

data_df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2'])

Col1.   Col2
[1,2,3] val1
[4,5,6] val2

I want to get the minimum value from the column 1 arrays. The expected results looks like:

Col1
1
4

I implemented the following Pandas UDF, but I got error:

**

An exception was thrown from a UDF: 'AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.

**

I don't know where is wrong?

def generate_min(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
  
  for x in batch_iter:
    yield min(x)


generate__udf = pandas_udf(generate_min, returnType=IntegerType())


data_df.select(generate_min(F.col('col1'))

Whatever method are OK, but performance should be considered. — lserlohn, Mar 17 '23 at 03:32

score 0 · Answer 1 · answered Mar 17 '23 at 05:54

Your DataFrame (data_df):

+---------+----+
|     Col1|Col2|
+---------+----+
|[1, 2, 3]|val1|
|[4, 5, 6]|val2|
+---------+----+

Use array_min() pyspark inbuilt function to get the minimum element from an array column.

from pyspark.sql.functions import array_min

data_df.select(
    array_min("Col1").alias("Col1")
).show()

Output

+----+
|Col1|
+----+
|   1|
|   4|
+----+

Pyspark Pandas UDF Series operation on Array column

1 Answers1