0

In order to apply PCA from pyspark.ml.feature, I need to convert a org.apache.spark.sql.types.ArrayType:array<float> to org.apache.spark.ml.linalg.VectorUDT Say I have the following dataframe :

df = spark.createDataFrame([
    ('string1',[5.0,4.0,0.5]),
    ('string2',[2.0,0.76,7.54]),
], schema='a string, b array<float>')

Whereas a = Vectors.dense(df.select('b').head(1)[0][0]) seems to work for one row, I was wondering how I can apply this function for all the rows.

W.314
  • 156
  • 8

1 Answers1

0

You'd have to map it back to RDD and manually create a Vector using lambda function

from pyspark.ml.linalg import Vectors

# df = ... # your df

df2 = df.rdd.map(lambda x: (x['a'], Vectors.dense(x['b']))).toDF(['a', 'b'])
df2.show()
df2.printSchema()

+-------+------------------------------------------+
|a      |b                                         |
+-------+------------------------------------------+
|string1|[5.0,4.0,0.5]                             |
|string2|[2.0,0.7599999904632568,7.539999961853027]|
+-------+------------------------------------------+

root
 |-- a: string (nullable = true)
 |-- b: vector (nullable = true)
pltc
  • 5,836
  • 1
  • 13
  • 31