I have a dataframe
and I apply a function to it. This function returns an numpy array
the code looks like this:
create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)
Now spark seems not to be happy with this and does not accept ArrayType(FloatType())
I get the following error message:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
I could just numpyarray.tolist()
and return a list version of it, but obviously I would always have to recreate the array
if I want to use it with numpy
.
so is there a way to store a numpy array
in a dataframe
column
?