4

I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic). I need to create a new column taking the sum of all the vector columns. I'm having a hard time getting this done. here is a code to generate a sample dataset that I'm testing on.

import org.apache.spark.ml.feature.VectorAssembler

val temp1 = spark.createDataFrame(Seq(
                                    (1,1.0,0.0,4.7,6,0.0),
                                    (2,1.0,0.0,6.8,6,0.0),
                                    (3,1.0,1.0,7.8,5,0.0),
                                    (4,0.0,1.0,4.1,7,0.0),
                                    (5,1.0,0.0,2.8,6,1.0),
                                    (6,1.0,1.0,6.1,5,0.0),
                                    (7,0.0,1.0,4.9,7,1.0),
                                    (8,1.0,0.0,7.3,6,0.0)))
                                    .toDF("id", "f1","f2","f3","f4","label")

val assembler1 = new VectorAssembler()
    .setInputCols(Array("f1","f2","f3"))
    .setOutputCol("vec1")

val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)

val assembler2 = new VectorAssembler()
    .setInputCols(Array("f2","f3", "f4"))
    .setOutputCol("vec2")

val df = assembler2.setHandleInvalid("skip").transform(temp2)

This gives me the following dataset

+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label|         vec1|         vec2|
+---+---+---+---+---+-----+-------------+-------------+
|  1|1.0|0.0|4.7|  6|  0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
|  2|1.0|0.0|6.8|  6|  0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
|  3|1.0|1.0|7.8|  5|  0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
|  4|0.0|1.0|4.1|  7|  0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
|  5|1.0|0.0|2.8|  6|  1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
|  6|1.0|1.0|6.1|  5|  0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
|  7|0.0|1.0|4.9|  7|  1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
|  8|1.0|0.0|7.3|  6|  0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+

If I needed to taek sum of regular columns, I can do it using something like,

import org.apache.spark.sql.functions.col

df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))

I know I can use breeze to sum DenseVectors just using "+" operator

import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2

So, the above code gives me the expected vector. But I'm not sure how to take the sum of the vector columns and sum vec1 and vec2 columns.

I did try the suggestions mentioned here, but had no luck

Jay Traband
  • 17,053
  • 1
  • 23
  • 44
rasthiya
  • 650
  • 1
  • 6
  • 20

1 Answers1

4

Here's my take but coded in PySpark. Someone can probably help in translating this to Scala:

from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array

def vector_sum (arr): 
    return Vectors.dense(np.sum(arr,axis=0))

vector_sum_udf = udf(vector_sum, VectorUDT())

df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))