5

I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:

root  
|-- Id: long (nullable = true)  
|-- Type: string (nullable = true)  
|-- Uri: string (nullable = true)    
|-- Type: array (nullable = true)  
|    |-- element: string (containsNull = true)  
|-- Gender: array (nullable = true)  
|    |-- element: string (containsNull = true)

Type and gender can contain array of elements, one element or null value. I tried to use the following code:

var resDf = df.withColumn("FlatType", explode(df("Type")))

But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.

How can I keep rows with null values but explode array of values?

I found some kind of workaround but still stuck in one place. For standard types we can do the following:

def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
  case "string" => 
    val avoidNull = udf((column: Seq[String]) =>
    if (column == null) Seq[String](null)
    else column)
    exploded = Some(explode(avoidNull(df(field))))
  case "boolean" => 
    val avoidNull = udf((xs: Seq[Boolean]) =>
    if (xs == null) Seq[Boolean]()
    else xs)
    exploded = Some(explode(avoidNull(df(field))))
  case _ => exploded = Some(explode(df(field)))
}
exploded.get

}

And after that just use it like this:

val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)

However, I have a problem for struct type for the following type of structure:

 |-- Address: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- AddressType: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true) 
 |    |    |-- DEA: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Number: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- ExpirationDate: array (nullable = true)
 |    |    |    |    |    |-- element: timestamp (containsNull = true)
 |    |    |    |    |-- Status: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)

How can we process that kind of schema when DEA is null?

Thank you in advance.

P.S. I tried to use Lateral views but result is the same.

Artem
  • 81
  • 1
  • 1
  • 9

2 Answers2

4

Maybe you can try using when:

val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))

As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • Sorry, but when I try this solution I have the following exception: java.lang.UnsupportedOperationException. If I replace explode(df("Type")) with just some value - it works fine. I suppose when function does not support exploded column as value – Artem Sep 11 '16 at 22:00
  • @Artem, you are correct, I'm sorry. Is a `union` an option for you? You could do `df.where($"Type".isNull).withColumn("FlatType", lit(null)).unionAll(df.withColumn("FlatType", explode($"Type")))` – Daniel de Paula Sep 12 '16 at 08:21
  • yes, thank you, I thought about this option but I am building a generic algorithm for flattening schema and I am afraid that union may be really slow. I hope to find a better solution but union is a backup option for me. – Artem Sep 12 '16 at 11:21
1

I think what you wanted is to use explode_outer instead of explode

see apache docs : explode and explode_outer

Hugo
  • 1,195
  • 2
  • 12
  • 36