2

I am trying to apply a UDF to a column in a PySpark df containing SparseVectors (created using pyspark.ml.feature.IDF). Originally, I was trying to apply a more involved function, but am getting the same error with any application of a function. So for the sake of an example:

udfSum = udf(lambda x: np.sum(x.values), FloatType()) 
df = df.withColumn("vec_sum", udfSum(df.idf)) 
df.take(10) 

I am getting this error:

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 55.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 55.0 (TID 111, 10.0.11.102): net.razorvine.pickle.PickleException:
expected zero arguments for construction of ClassDict (for numpy.dtype)

If I convert the df to Pandas and apply the function, I can confirm that FloatType() is the correct response type. This may be related, but the error is different: Issue with UDF on a column of Vectors in PySpark DataFrame.

Thanks!

Community
  • 1
  • 1
abbbby
  • 21
  • 2

1 Answers1

4

Convert output to float:

udf(lambda x: float(np.sum(x.values)), FloatType())