JavaPairRDD to Dataset in SPARK

Question

I have data in JavaPairRDD in format

JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>

I tried using below code

 Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
 Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
 Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");

But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???

what do you want flatten the tuple? try `toDF("value1","value2","value3")` — jojo_Berlin, Jun 13 '18 at 10:13
It seems that it's a bug - values in tuple should have distinguish field names. Feel free to create a Jira ticket — T. Gawęda, Jun 13 '18 at 11:21
Thanks @T.Gawęda for your reply, i have created Jira ticket https://issues.apache.org/jira/browse/SPARK-24548 lets see when someone picks that up — Jack, Jun 13 '18 at 11:58

T. Gawęda · Answer 1 · 2018-06-15T11:41:55.880

0

Try to run printSchema - you will see, that value2 is a complex type.

Having such information, you can write:

Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")

value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only

Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

edited Jun 15 '18 at 11:41

answered Jun 13 '18 at 10:20

T. Gawęda

15,706
4
46
61

Thanks for your reply, but when tried printing schema it gave me below response: ` root |-- value1: string (nullable = true) |-- value2: struct (nullable = true) | |-- value: string (nullable = true) | |-- value: string (nullable = true) ` Here value2 has same names as value, when tried to run select: [Dataset].selectExpr("value1", "value2._1 as value2").show(); Exception what i got is "Exception :: No such struct field _1 in value, value; line 1 pos 0". – Jack Jun 13 '18 at 11:06
@Jack Thanks for the response. Probably some typo, I'm investigating it – T. Gawęda Jun 13 '18 at 11:08

JavaPairRDD to Dataset in SPARK

1 Answers1