I have a dataframe like this
data_df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2'])
Col1. Col2
[1,2,3] val1
[4,5,6] val2
I want to get the minimum value from the column 1 arrays. The expected results looks like:
Col1
1
4
I implemented the following Pandas UDF, but I got error:
**
An exception was thrown from a UDF: 'AssertionError: Pandas SCALAR_ITER UDF outputted more rows than input rows.
**
I don't know where is wrong?
def generate_min(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
for x in batch_iter:
yield min(x)
generate__udf = pandas_udf(generate_min, returnType=IntegerType())
data_df.select(generate_min(F.col('col1'))