2

I am using Spark Shell v_1.6.1.5.

I have the following Spark Scala Dataframe:

val data = sqlContext.read.avro("/my/location/*.avro")
data.printSchema
root
 |-- id: long (nullable = true)
 |-- stuff: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = false)
 |    |    |-- created: long (nullable = false)
 |    |    |-- lastModified: long (nullable = false)
 |    |    |-- . . .

What's the exact syntax to 'explode it' into the following flat format (discarding possible null values): [id, key, value] ?

Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154

1 Answers1

2

This can be accomplished using a udf and explode.

However we do not know the class of the values in the map because this information is inferred and is not available as an explicit class. In order to overcome this issue we can "shadow" the inferred class by making a case class with an identical class signature. Spark then treats these classes identically because both the inferred class and our shadow class are converted into an identical StructType

Here is an example (the case class value is a standin for the inferred class that we do not know).

scala> case class value(created: Long, lastModified: Long)
defined class value

scala> val myDF = Seq((1, Map("a" -> value(1L,2L), "b" -> value(3L,4L))), (2, Map("c" -> value(5L,6L), "d" -> value(6L,7L)))).toDF("id", "stuff")
myDF: org.apache.spark.sql.DataFrame = [id: int, stuff: map<string,struct<created:bigint,lastModified:bigint>>]

scala> myDF.show
+---+--------------------+
| id|               stuff|
+---+--------------------+
|  1|Map(a -> [1,2], b...|
|  2|Map(c -> [5,6], d...|
+---+--------------------+


scala> myDF.printSchema
root
 |-- id: integer (nullable = false)
 |-- stuff: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- created: long (nullable = false)
 |    |    |-- lastModified: long (nullable = false)


scala> case class shadowOfValue(created: Long, lastModified: Long)
defined class shadowOfValue

scala> val explodeUDF = udf( (map: Map[String, shadowOfValue]) => map.toVector)
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StructType(StructField(created,LongType,false), StructField(lastModified,LongType,false)),true)),true),Some(List(MapType(StringType,StructType(StructField(created,LongType,false), StructField(lastModified,LongType,false)),true))))

scala> var newDF = myDF.withColumn("TMP", explode(explodeUDF($"stuff"))).drop("stuff")
newDF: org.apache.spark.sql.DataFrame = [id: int, TMP: struct<_1: string, _2: struct<created: bigint, lastModified: bigint>>]

scala> newDF = newDF.withColumn("key", $"TMP".apply("_1")).withColumn("value", $"TMP".apply("_2"))
newDF: org.apache.spark.sql.DataFrame = [id: int, TMP: struct<_1: string, _2: struct<created: bigint, lastModified: bigint>> ... 2 more fields]

scala> newDF = newDF.drop("TMP")
newDF: org.apache.spark.sql.DataFrame = [id: int, key: string ... 1 more field]

scala> newDF.show
+---+---+-----+
| id|key|value|
+---+---+-----+
|  1|  a|[1,2]|
|  1|  b|[3,4]|
|  2|  c|[5,6]|
|  2|  d|[6,7]|
+---+---+-----+


scala> newDF.printSchema
root
 |-- id: integer (nullable = false)
 |-- key: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- created: long (nullable = false)
 |    |-- lastModified: long (nullable = false)
evan.oman
  • 5,922
  • 22
  • 43
  • thanks! I updated my question while you were typing your answer. I have a Map[string, struct], what should I use instead of String? – Marsellus Wallace Nov 15 '16 at 20:44
  • Hmm, just had to make it harder :P Let me think about it – evan.oman Nov 15 '16 at 20:46
  • @Gevorg The custom struct adds a kink. Is it defined anywhere in your code? I need to have the value type in my `udf`. – evan.oman Nov 15 '16 at 21:00
  • 1
    It is not defined explicitely. I believe that it is derived automatically from the avro schema of the input files. – Marsellus Wallace Nov 15 '16 at 21:04
  • @Gevorg yeah I was hoping it wouldn't be derived – evan.oman Nov 15 '16 at 21:05
  • In the Java generated Avro class it's a custom Object (think of a generic class Person for instance) – Marsellus Wallace Nov 15 '16 at 21:09
  • @Gevorg Yeah I need to specify the type in the parameter list of the UDF (`Map[String, NEEDTHISTYPE]`). If the class were defined locally (like a Person class), I could plug that in but since I can't pull the actual type from the schema I don't think that a UDF will work here – evan.oman Nov 15 '16 at 21:14
  • Is there anything that can be done if we imagine that the custom class contains only 2 fields 'created', and 'lastModified'? – Marsellus Wallace Nov 15 '16 at 21:17
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128206/discussion-between-evan058-and-gevorg). – evan.oman Nov 15 '16 at 21:20