How to replace nulls in Vector column?

Question

I have a column of type [vector] and I have null values in it that I can't get rid of, here's an example

import org.apache.spark.mllib.linalg.Vectors

val sv1: Vector = Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0))
val df_1 = sc.parallelize(List(("id_1", sv1))).toDF("id", "feature_vector")
val df_2 = sc.parallelize(List(("id_1", 10.0), ("id_2", 10.0))).toDF("id", "numeric_feature")

val df_joined = df_1.join(df_2, Seq("id"), "right")

df_joined.show()

+----+--------------------+---------------+
|  id|      feature_vector|numeric_feature|
+----+--------------------+---------------+
|id_1|(58,[8,45],[1.0,1...|           10.0|
|id_2|                null|           10.0|
+----+--------------------+---------------+

What i'd like to do:

val map = Map("feature_vector" -> sv1)
val result = df_joined.na.fill(map)

But that throws an error:

Message: Unsupported value type org.apache.spark.mllib.linalg.SparseVector ((58,[8,45],[1.0,1.0])).

Other things i've tried:

df_joined.withColumn("feature_vector", when(col("feature_vector").isNull, sv1).otherwise(sv1)).show

from how to filter out a null value from spark dataframe

I'm struggling to find a solution that would work on Spark 1.6

To add to your problems, I don't think you can return a vector from a UDF in 1.6. — philantrovert, Jun 07 '18 at 13:50
@philantrovert I think i ran into that wall during one of my attempts, too. Luckily, user8371915 's suggestion worked! — Alexvonrass, Jun 07 '18 at 14:07
The answer by @user8371915 is definitely better and doesn't require switching between RDD and DF. Please accept that. — philantrovert, Jun 07 '18 at 14:09
@philantrovert my bad, for some reason i thought you could accept multiple solutions. Thank you! — Alexvonrass, Jun 07 '18 at 14:14

score 4 · Accepted Answer · answered Jun 07 '18 at 13:55

4

Coalesce and join should do the trick

import org.apache.spark.sql.functions.{coalesce, broadcast}

val fill = Seq(
  Tuple1(Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0)))
).toDF("fill")


df_joined
  .join(broadcast(fill))
  .withColumn("feature_vector", coalesce($"feature_vector", $"fill"))
  .drop("fill")

answered Jun 07 '18 at 13:55

Alper t. Turker

34,230
9
83
115

in Spark > 2.X you need to use crossJoin instead of join – glisu Aug 05 '20 at 16:18

score 0 · Answer 2 · answered Jun 07 '18 at 14:06

You could take the help of RDDs here if you like:

val naFillRDD = df_joined.map{ r => r match{
  case Row(id, feature_vector: Vector, numeric_feature ) => Row(id, feature_vector, numeric_feature )
  case Row(id, _, numeric_feature) => Row(id, sv1, numeric_feature)
}}

And then switch back to dataframe:

val naFillDF = sqlContext.createDataFrame(naFillRDD, df_joined.schema)

naFillDF.show(false)
//+----+---------------------+---------------+
//|id  |feature_vector       |numeric_feature|
//+----+---------------------+---------------+
//|id_1|(58,[8,45],[1.0,1.0])|10.0           |
//|id_2|(58,[8,45],[1.0,1.0])|10.0           |
//+----+---------------------+---------------+

How to replace nulls in Vector column?

2 Answers2