Apply vectors.Dense() to an array float column in pyspark 3.2.1

Question

In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT Say I have the following dataframe :

df = spark.createDataFrame([
    ('string1',[5.0,4.0,0.5]),
    ('string2',[2.0,0.76,7.54]),
], schema='a string, b array<float>')

Whereas a = Vectors.dense(df.select('b').head(1)[0][0]) seems to work for one row, I was wondering how I can apply this function for all the rows.

Out of curiosity, how did you end up with an array of float column? — pltc, Apr 19 '22 at 05:15

score 0 · Answer 1 · answered Apr 19 '22 at 05:19

You'd have to map it back to RDD and manually create a Vector using lambda function

from pyspark.ml.linalg import Vectors

# df = ... # your df

df2 = df.rdd.map(lambda x: (x['a'], Vectors.dense(x['b']))).toDF(['a', 'b'])
df2.show()
df2.printSchema()

+-------+------------------------------------------+
|a      |b                                         |
+-------+------------------------------------------+
|string1|[5.0,4.0,0.5]                             |
|string2|[2.0,0.7599999904632568,7.539999961853027]|
+-------+------------------------------------------+

root
 |-- a: string (nullable = true)
 |-- b: vector (nullable = true)

Apply vectors.Dense() to an array float column in pyspark 3.2.1

1 Answers1