1

I have data in JavaPairRDD in format

JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>

I tried using below code

 Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
 Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
 Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");

But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???

Jack
  • 197
  • 1
  • 21

1 Answers1

0

Try to run printSchema - you will see, that value2 is a complex type.

Having such information, you can write:

Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")

value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only

Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Thanks for your reply, but when tried printing schema it gave me below response: ` root |-- value1: string (nullable = true) |-- value2: struct (nullable = true) | |-- value: string (nullable = true) | |-- value: string (nullable = true) ` Here value2 has same names as value, when tried to run select: [Dataset].selectExpr("value1", "value2._1 as value2").show(); Exception what i got is "Exception :: No such struct field _1 in value, value; line 1 pos 0". – Jack Jun 13 '18 at 11:06
  • @Jack Thanks for the response. Probably some typo, I'm investigating it – T. Gawęda Jun 13 '18 at 11:08