0

I have data in Cassandra (3.11.2) as which is also my df :

Data in Cassandra:

id | some_data  
-- | ---------  
1  | [{s1:"str11", s2:"str12"},{s1:"str13", s2:"str14"}]
2  | [{s1:"str21", s2:"str22"},{s1:"str23", s2:"str24"}]
3  | [{s1:"str31", s2:"str32"},{s1:"str33", s2:"str44"}]

df details:

 df.printSchema() 
    //|  |-- id: integer (nullable = true)
    //|  |-- some_data: array (nullable = true)
    //|  |    |-- element: struct (containsNull = true)
    //|  |    |    |-- s1: string (nullable = true)
    //|  |    |    |-- s2: string (nullable = true)

Here the Cassandra schema is defined as:

id : String
some_data : list frozen test_udt created as --> CREATE TYPE test.test_udt ( s1 text, s2 text );

I'm using spark-cassandra-connector 2.0 to pull data from Cassandra for processing on Spark 2.2.1.

Required Output

The output is the exploded form of df

id | some_data                                          | s1    | s2  
-- | ---------------------------------------------------| ----- | ---- 
1  | [{s1:"str11", s2:"str12"},{s1:"str13", s2:"str14"}]| str11 | str12
1  | [{s1:"str11", s2:"str12"},{s1:"str13", s2:"str14"}]| str13 | str14 
2  | [{s1:"str21", s2:"str22"},{s1:"str23", s2:"str24"}]| str21 | str22
2  | [{s1:"str21", s2:"str22"},{s1:"str23", s2:"str24"}]| str23 | str24
3  | [{s1:"str31", s2:"str32"},{s1:"str33", s2:"str44"}]| str31 | str32
3  | [{s1:"str31", s2:"str32"},{s1:"str33", s2:"str44"}]| str33 | str34

My approach in the past

I have used spark-cassandra-connector 1.6 and on Spark 1.6 and I had a neat solution to the above problem as:

import org.apache.spark.sql.functions._    
case class my_data(s1 : String, s2 : String)

val flatData = df.explode(df("some_data")){
            case Row(x : Seq[Row]) =>
                x.map(x =>
                    my_data(
                        x.apply(0).asInstanceOf[String], 
                        x.apply(1).asInstanceOf[String]
                    ))
                  }
flatData.show()

After we upgraded to 2.x, I'm getting an error on using explode function. The spark document says explode is deprecated. flatMap is suggested as an alternative to explode.

Questions:

  1. How do I explode the Dataframe in Scala to get the same results as before?
  2. How do I translate my old code using flatmap?
underwood
  • 845
  • 2
  • 11
  • 22
  • Try using `df.col("some_data").as[my_data]` – vindev Apr 25 '18 at 06:41
  • @vindev: Is this like converting to Dataset? I did this : `val b = df.col("some_data").as[my_data]` and I got an error : `Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. not enough arguments for method as: (implicit evidence$1: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.TypedColumn[Any,U]. Unspecified value parameter evidence$1.` – underwood Apr 25 '18 at 14:16
  • As the error says, you need to import `spark.implicits._` – vindev Apr 26 '18 at 04:38
  • @vindev : I already have `spark.implicits._` in my code. My bad.. I should have mentioned it here. Despite adding it, I see the same error. The below solution worked for me. Thank you for replying to my question though :) – underwood Apr 27 '18 at 03:14

1 Answers1

1

You can use explode function which is also suggested as an alternative to the explode method. getItem is used to get a field from a struct by it's name.

df.withColumn("exploded" , explode($"some_data"))
  .withColumn("s1" , $"exploded".getItem("s1"))
  .withColumn("s2" , $"exploded".getItem("s2"))
  .drop("exploded")
  .show(false)

//+---+------------------------------+-----+-----+
//|id |some_data                     |s1   |s2   |
//+---+------------------------------+-----+-----+
//|1  |[[str11,str12], [str13,str14]]|str11|str12|
//|1  |[[str11,str12], [str13,str14]]|str13|str14|
//|2  |[[str21,str22], [str23,str24]]|str21|str22|
//|2  |[[str21,str22], [str23,str24]]|str23|str24|
//|3  |[[str31,str32], [str33,str44]]|str31|str32|
//|3  |[[str31,str32], [str33,str44]]|str33|str44|
//+---+------------------------------+-----+-----+
philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • This worked !!! Thank you. How do you reckon performance will be when using multiple `explode` in a statement vs using `flatmap`? – underwood Apr 25 '18 at 14:24
  • You can't use multiple `explode`s in a single statement. Performance should be equivalent as it's suggested as an alternative to the `flatMap` or `explode` methods – philantrovert Apr 25 '18 at 14:52