0

I have a Dataset that I wish to convert to a type-dataset where the type is a case class having Option for several parameters. For example using spark shell I create a case class, a encoder and (raw) Dataset:

case class Analogue(id: Long, t1: Option[Double] = None, t2: Option[Double] = None)
val df = Seq((1, 34.0), (2,3.4)).toDF("id", "t1")
implicit val analogueChannelEncoder: Encoder[Analogue] = Encoders.product[Analogue]

I want to create a Dataset<Analogue> from df so I try:

df.as(analogueChannelEncoder)

But this results in the error:

org.apache.spark.sql.AnalysisException: cannot resolve '`t2`' given input columns: [id, t1];

Looking at the schemas of df and analogueChannelEncoder the difference is apparent:

scala> df.schema
res3: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,false), StructField(t1,DoubleType,false))

scala> analogueChannelEncoder.schema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false), StructField(t1,DoubleType,true), StructField(t2,DoubleType,true))

I have seen this answer but this will not work for me as my Dataset is assembled and is not a straight-forward load from a data source

How can I cast my untyped Dataset<Row> to Dataset<Analogue>?

D-Dᴙum
  • 7,689
  • 8
  • 58
  • 97

2 Answers2

1

Your case class

  case class Analogue(id: Long, t1: Option[Double] = None, t2: Option[Double] = None)

your conversion code ...

 val encoderSchema = Encoders.product[Analogue].schema
  val df1: Dataset[Row] = spark.createDataset(Seq((1, 34.0), (2, 3.4))).map(x => Analogue(x._1, Option(x._2), None))
    .toDF("id", "t1", "t2")
  df1.show

  df1.printSchema()
  encoderSchema.printTreeString()

Result :

+---+----+----+
| id|  t1|  t2|
+---+----+----+
|  1|34.0|null|
|  2| 3.4|null|
+---+----+----+

root
 |-- id: long (nullable = false)
 |-- t1: double (nullable = true)
 |-- t2: double (nullable = true)

root
 |-- id: long (nullable = false)
 |-- t1: double (nullable = true)
 |-- t2: double (nullable = true)

Update (increased more columns in the case class which are options) :

I am assuming that your case class has many fields (for example 5 fields ) if the option values are none then ... it works like below...

Below is the example


  case class Analogue(id: Long, t1: Option[Double] = None, t2: Option[Double] = None, t3: Option[Double] = None, t4: Option[Double] = None, t5: Option[Double] = None)



  val encoderSchema = Encoders.product[Analogue].schema
  println(encoderSchema.toSeq)
  val df1 = spark.createDataset(Seq((1, 34.0), (2, 3.4)))
    .map(x => Analogue(x._1, Option(x._2)))
     .as[Analogue].toDF()
  df1.show
  df1.printSchema()
  encoderSchema.printTreeString()

if you set the fields that are present, remaining fields it will take it as None.

StructType(StructField(id,LongType,false), StructField(t1,DoubleType,true), StructField(t2,DoubleType,true), StructField(t3,DoubleType,true), StructField(t4,DoubleType,true), StructField(t5,DoubleType,true))
+---+----+----+----+----+----+
| id|  t1|  t2|  t3|  t4|  t5|
+---+----+----+----+----+----+
|  1|34.0|null|null|null|null|
|  2| 3.4|null|null|null|null|
+---+----+----+----+----+----+

root
 |-- id: long (nullable = false)
 |-- t1: double (nullable = true)
 |-- t2: double (nullable = true)
 |-- t3: double (nullable = true)
 |-- t4: double (nullable = true)
 |-- t5: double (nullable = true)

root
 |-- id: long (nullable = false)
 |-- t1: double (nullable = true)
 |-- t2: double (nullable = true)
 |-- t3: double (nullable = true)
 |-- t4: double (nullable = true)
 |-- t5: double (nullable = true)

If its not working this way pls consider my comment broadcast idea and work further on it.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • this is a nice answer however, and this is my fault, in my example I didn't make it clear that it could be that t2 is specified while t1 is not. Your solution unfortunately hard coded for one variation of of available data (on the second line). In my actual problem I have 50 or more variables that are optional so need to avoid hard coding combinations of data. – D-Dᴙum May 18 '20 at 21:46
  • I think you could broadcast `encoderSchema.toSeq` (which is target schema) and then based on position value you can replace Option with None or anything which is relavent there. for this you need to use `val mytargetschema = spark.sparkContext.broadcast(encoderSchema.toSeq)` you will get witthin your map (showed above) like mytargetschema.value then you can play around with it with out hard coding ... need to do bit more hard work.:-).. – Ram Ghadiyaram May 18 '20 at 21:58
  • @RamGhadiyaram.. I get this error ````java.lang.ClassCastException: $line9.$read$$iw$$iw$Analogue cannot be cast to $line9.$read$$iw$$iw$Analogue```` in spark 2.4. – stack0114106 May 20 '20 at 04:04
  • above code is perfectly fine after testing i added here. pls check you might have done some silly mistake – Ram Ghadiyaram May 20 '20 at 04:20
  • @Kerry : was it useful ? – Ram Ghadiyaram May 20 '20 at 04:21
0

I have resolved the issue by inspecting the 'incoming' Dataset<Row> for its columns and comparing them to the columns in a Dataset<Analogue>. The difference that results I use to append new columns to my Dataset<Row> before then casting it as Dataset<Analogue>.

D-Dᴙum
  • 7,689
  • 8
  • 58
  • 97