I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.