I am running Spark 2.3. I want to convert the column features
in the following DataFrame from ArrayType
to a DenseVector
. I am using Spark in Java.
+---+--------------------+
| id| features|
+---+--------------------+
| 0|[4.191401, -1.793...|
| 10|[-0.5674514, -1.3...|
| 20|[0.735613, -0.026...|
| 30|[-0.030161237, 0....|
| 40|[-0.038345724, -0...|
+---+--------------------+
root
|-- id: integer (nullable = false)
|-- features: array (nullable = true)
| |-- element: float (containsNull = false)
I have written the following UDF
but it doesn't seem to be working:
private static UDF1 toVector = new UDF1<Float[], Vector>() {
private static final long serialVersionUID = 1L;
@Override
public Vector call(Float[] t1) throws Exception {
double[] DoubleArray = new double[t1.length];
for (int i = 0 ; i < t1.length; i++)
{
DoubleArray[i] = (double) t1[i];
}
Vector vector = (org.apache.spark.mllib.linalg.Vector) Vectors.dense(DoubleArray);
return vector;
}
}
I wish to extract the following features as a vector so that I can perform clustering on it.
I am also registering the UDF and then proceeding on to call it as follows:
spark.udf().register("toVector", (UserDefinedAggregateFunction) toVector);
df3 = df3.withColumn("featuresnew", callUDF("toVector", df3.col("feautres")));
df3.show();
On running this snippet I am facing the following error:
ReadProcessData$1 cannot be cast to org.apache.spark.sql.expressions. UserDefinedAggregateFunction