What Type should the dense vector be, when using UDF function in Pyspark?

Question

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *


conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)


df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
    print '?????',column
    return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()

I have tried many Types , and this DenseVector() give me error:

Traceback (most recent call last):
  File "t.py", line 21, in <module>
    getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)

Help me, please.

score 15 · Accepted Answer · answered Apr 03 '18 at 07:19

15

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

answered Apr 03 '18 at 07:19

Suresh

5,678
2
24
40

Thank you, but my spark version is 1.6.0, which does not have VectorUDT, this's why I asked this question – nick_liu Apr 03 '18 at 07:24
1.6 has VectorUDT in mllib. Just import as , from pyspark.mllib.linalg import Vectors, VectorUDT, http://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/linalg.html – Suresh Apr 03 '18 at 07:33
You are right, thank u – nick_liu Apr 04 '18 at 02:33
what is the equivalent in scala (ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())) ? – sri hari kali charan Tummala May 29 '19 at 04:44

What Type should the dense vector be, when using UDF function in Pyspark?

1 Answers1

Linked