I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic). I need to create a new column taking the sum of all the vector columns. I'm having a hard time getting this done. here is a code to generate a sample dataset that I'm testing on.
import org.apache.spark.ml.feature.VectorAssembler
val temp1 = spark.createDataFrame(Seq(
(1,1.0,0.0,4.7,6,0.0),
(2,1.0,0.0,6.8,6,0.0),
(3,1.0,1.0,7.8,5,0.0),
(4,0.0,1.0,4.1,7,0.0),
(5,1.0,0.0,2.8,6,1.0),
(6,1.0,1.0,6.1,5,0.0),
(7,0.0,1.0,4.9,7,1.0),
(8,1.0,0.0,7.3,6,0.0)))
.toDF("id", "f1","f2","f3","f4","label")
val assembler1 = new VectorAssembler()
.setInputCols(Array("f1","f2","f3"))
.setOutputCol("vec1")
val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)
val assembler2 = new VectorAssembler()
.setInputCols(Array("f2","f3", "f4"))
.setOutputCol("vec2")
val df = assembler2.setHandleInvalid("skip").transform(temp2)
This gives me the following dataset
+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label| vec1| vec2|
+---+---+---+---+---+-----+-------------+-------------+
| 1|1.0|0.0|4.7| 6| 0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
| 2|1.0|0.0|6.8| 6| 0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
| 3|1.0|1.0|7.8| 5| 0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
| 4|0.0|1.0|4.1| 7| 0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
| 5|1.0|0.0|2.8| 6| 1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
| 6|1.0|1.0|6.1| 5| 0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
| 7|0.0|1.0|4.9| 7| 1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
| 8|1.0|0.0|7.3| 6| 0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+
If I needed to taek sum of regular columns, I can do it using something like,
import org.apache.spark.sql.functions.col
df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))
I know I can use breeze to sum DenseVectors just using "+" operator
import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2
So, the above code gives me the expected vector. But I'm not sure how to take the sum of the vector columns and sum vec1
and vec2
columns.
I did try the suggestions mentioned here, but had no luck