I am new in PySpark and trying to append a dataframe with a numpy array.
I have a numpy array as:
print(category_dimension_vectors)
[[ 5.19333403e-01 -3.36615935e-01 -6.93262848e-02 2.37293671e-01]
[ 4.45220874e-01 1.30108798e-01 1.12913839e-01 1.87161517e-01]]
I would like to append this to a pyspark dataframe as a new column where each row in the array stored in its correspondent row in the dataframe.
Number of rows in the array, and the number of rows in the dataframe are equal.
This is what I have tried first:
arr_rows = udf(lambda row: category_dimension_vectors[row,:], ArrayType(DecimalType()))
df = df.withColumn("category_dimensions_reduced", arr_rows(df))
Getting the error:
TypeError: Invalid argument, not a string or column
Then I have tried
df = df.withColumn("category_dimensions_reduced", lit(category_dimension_vectors))
But got the error:
org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE]
What I try to achieve is:
+----+----+-----------------------------------------------------------------+
| a| b|category_dimension_vectors |
+----+----+-----------------------------------------------------------------+
|foo | 1|[5.19333403e-01,-3.36615935e-01,-6.93262848e-02,2.37293671e-01] |
|bar | 2|[4.45220874e-01,1.30108798e-01,1.12913839e-01,1.87161517e-01] |
+----+----+-----------------------------------------------------------------+
How should I approach to this problem?