1

Consider following dataframe in pyspark:

+-----------+
|      Col A|
+-----------+
| [0.5, 0.6]|                  
| [0.7, 0.8]|                   
| [1.1, 1.5]|                                 
+-----------+

The type of Col A is vector, how can I create a new column which has values of Col A but is of type array or string?

df:

+-----------+-----------+
|Col A      |new_column |
+-----------+-----------+
| [0.5, 0.6]|  0.5, 0.6 |               
| [0.7, 0.8]|  0.7, 0.8 |            
| [1.1, 1.5]|  1.1, 1.5 |                         
+-----------+-----------+

Thanks in advance!

tia
  • 137
  • 1
  • 2
  • 9

2 Answers2

0

If you just want to convert Vector into Array[Double] this is fairly simple with the UDF:

import org.apache.spark.ml.linalg.DenseVector
val toArr: Any => Array[Double] = _.asInstanceOf[DenseVector].toArray
val toArrUdf = udf(toArr)
val dataWithFeaturesArr = dataWithFeatures.withColumn("A_arr",toArrUdf('COl A'))
Vijay
  • 123
  • 6
-1

A possible solution could be:

scala> output.show
+---+---------+
| id|vectorCol|
+---+---------+
|  0|[1.2,1.3]|
|  1|[2.2,2.3]|
|  2|[3.2,3.3]|
+---+---------+


scala> output.printSchema
root
 |-- id: integer (nullable = false)
 |-- vectorCol: vector (nullable = true)


scala> import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.DenseVector

scala> val toArr: Any => Array[Double] = _.asInstanceOf[DenseVector].toArray
toArr: Any => Array[Double] = <function1>

scala> val toArrUdf = udf(toArr)
toArrUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(DoubleType,false),None)

scala> val df1 = output.withColumn("features_arr",toArrUdf('vectorCol))

scala> df1.show
+---+---------+------------+
| id|vectorCol|features_arr|
+---+---------+------------+
|  0|[1.2,1.3]|  [1.2, 1.3]|
|  1|[2.2,2.3]|  [2.2, 2.3]|
|  2|[3.2,3.3]|  [3.2, 3.3]|
+---+---------+------------+

scala> df1.printSchema
root
 |-- id: integer (nullable = false)
 |-- vectorCol: vector (nullable = true)
 |-- features_arr: array (nullable = true)
 |    |-- element: double (containsNull = false)

A possible implementation in pyspark could be seen in this link.

Let me know if it helps!!

Anand Sai
  • 1,566
  • 7
  • 11
  • Hey, thanks for your reply, the link for pysaprk they use vector_to_array function which doesn't work and I can't find any documentation for it.. – tia Mar 03 '20 at 21:18
  • @tia In the same link, there are other possible answers. Let me know if you are able to find for spark < 3.0.0 – Anand Sai Mar 03 '20 at 21:21