1

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:

import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint


case class Obs(f1: Double, f2: Double, price: Array[String])

val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))

val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()

val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))

labeled.take(2).foreach(println)

The output looks like:

df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
 |-- f1: double (nullable = false)
 |-- f2: double (nullable = false)
 |-- price: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---+---+-----------+
| f1| f2|      price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+

but then I wind up getting a ClassCastException:

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?

The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.

zero323
  • 322,348
  • 103
  • 959
  • 935
schnee
  • 1,050
  • 2
  • 9
  • 20

2 Answers2

2

I think problem here:

.asInstanceOf[Array[String]]
terma
  • 1,199
  • 1
  • 8
  • 15
  • can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do. – schnee Oct 23 '15 at 12:18
1

Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row

val assembler = new VectorAssembler()
  .setInputCols(Array("f1", "f2"))
  .setOutputCol("features")

val labeled = assembler.transform(df)
  .select($"price".getItem(1).cast("double"), $"features")
  .map{case Row(price: Double, features: Vector) => 
    LabeledPoint(price, features)}

Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use

import scala.collection.mutable.WrappedArray

row.getAs[WrappedArray[String]](2)

or simply

row.getAs[Seq[String]](2)
zero323
  • 322,348
  • 103
  • 959
  • 935