0

I am running Spark 2.3. I want to convert the column features in the following DataFrame from ArrayType to a DenseVector. I am using Spark in Java.

+---+--------------------+
| id|            features|
+---+--------------------+
|  0|[4.191401, -1.793...|
| 10|[-0.5674514, -1.3...|
| 20|[0.735613, -0.026...|
| 30|[-0.030161237, 0....|
| 40|[-0.038345724, -0...|
+---+--------------------+

root
 |-- id: integer (nullable = false)
 |-- features: array (nullable = true)
 |    |-- element: float (containsNull = false)

I have written the following UDF but it doesn't seem to be working:

private static UDF1 toVector = new UDF1<Float[], Vector>() {

    private static final long serialVersionUID = 1L;

    @Override
    public Vector call(Float[] t1) throws Exception {

        double[] DoubleArray = new double[t1.length];
        for (int i = 0 ; i < t1.length; i++)
        {
            DoubleArray[i] = (double) t1[i];
        }   
    Vector vector = (org.apache.spark.mllib.linalg.Vector) Vectors.dense(DoubleArray);
    return vector;
    }
}

I wish to extract the following features as a vector so that I can perform clustering on it.

I am also registering the UDF and then proceeding on to call it as follows:

spark.udf().register("toVector", (UserDefinedAggregateFunction) toVector);
df3 = df3.withColumn("featuresnew", callUDF("toVector", df3.col("feautres")));
df3.show();  

On running this snippet I am facing the following error:

ReadProcessData$1 cannot be cast to org.apache.spark.sql.expressions. UserDefinedAggregateFunction

Shaido
  • 27,497
  • 23
  • 70
  • 73
VG23
  • 31
  • 4

1 Answers1

2

The problem lies in how you are registering the udf in Spark. You should not use UserDefinedAggregateFunction which is not an udf but an udaf used for aggregations. Instead what you should do is:

spark.udf().register("toVector", toVector, new VectorUDT());

Then to use the registered function, use:

df3.withColumn("featuresnew", callUDF("toVector",df3.col("feautres")));

The udf itself should be slightly adjusted as follows:

UDF1 toVector = new UDF1<Seq<Float>, Vector>(){

  public Vector call(Seq<Float> t1) throws Exception {

    List<Float> L = scala.collection.JavaConversions.seqAsJavaList(t1);
    double[] DoubleArray = new double[t1.length()]; 
    for (int i = 0 ; i < L.size(); i++) { 
      DoubleArray[i]=L.get(i); 
    } 
    return Vectors.dense(DoubleArray); 
  } 
};

Note that in Spark 2.3+ you can create a scala-style udf that can be invoked directly. From this answer:

UserDefinedFunction toVector = udf(
  (Seq<Float> array) -> /* udf code or method to call */, new VectorUDT()
);

df3.withColumn("featuresnew", toVector.apply(col("feautres")));
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • @BdEngineer: For machine learning in Spark, Vectors (`DenseVector`, `SparseVector`) are used for input instead of arrays. There could be other use cases as well. – Shaido Mar 13 '20 at 01:11