0

I am new in PySpark and trying to append a dataframe with a numpy array.

I have a numpy array as:

print(category_dimension_vectors)

[[ 5.19333403e-01 -3.36615935e-01 -6.93262848e-02  2.37293671e-01]
 [ 4.45220874e-01  1.30108798e-01  1.12913839e-01  1.87161517e-01]]

I would like to append this to a pyspark dataframe as a new column where each row in the array stored in its correspondent row in the dataframe.

Number of rows in the array, and the number of rows in the dataframe are equal.

This is what I have tried first:

arr_rows = udf(lambda row: category_dimension_vectors[row,:], ArrayType(DecimalType()))

df = df.withColumn("category_dimensions_reduced", arr_rows(df))

Getting the error:

TypeError: Invalid argument, not a string or column

Then I have tried

df = df.withColumn("category_dimensions_reduced", lit(category_dimension_vectors))

But got the error:

org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] 

What I try to achieve is:

+----+----+-----------------------------------------------------------------+
|   a|   b|category_dimension_vectors                                       |
+----+----+-----------------------------------------------------------------+
|foo |   1|[5.19333403e-01,-3.36615935e-01,-6.93262848e-02,2.37293671e-01]  |       
|bar |   2|[4.45220874e-01,1.30108798e-01,1.12913839e-01,1.87161517e-01]    |             
+----+----+-----------------------------------------------------------------+

How should I approach to this problem?

Edfern
  • 134
  • 9
  • Hey, unless you have a more complicated usage in mind I think your answer is to convert the numpy array into a dataframe first like in this question https://stackoverflow.com/questions/45063591/creating-spark-dataframe-from-numpy-matrix – FJ_OC Feb 02 '23 at 14:29

0 Answers0