1

Take the following example dataframe:

val df = Seq(Seq("xxx")).toDF("a")

Schema:

root
 |-- a: array (nullable = true)
 |    |-- element: string (containsNull = true)

How can I modify df in-place so that the resulting dataframe is not nullable anywhere, i.e. has the following schema:

root
 |-- a: array (nullable = false)
 |    |-- element: string (containsNull = false)

I understand that I can re-create another dataframe enforcing a non-nullable schema, such as following Change nullable property of column in spark dataframe

spark.createDataFrame(df.rdd, StructType(StructField("a", ArrayType(StringType, false), false) :: Nil))

But this is not an option under structured streaming, so I want it to be some kind of in-place modification.

Naitree
  • 1,088
  • 1
  • 14
  • 23
  • Does this answer your question? [Change nullable property of column in spark dataframe](https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe) – Lamanus Aug 21 '20 at 10:53
  • So what do you want to happen when you try to convert an array with a null element to a DataFrame? – kfkhalili Aug 21 '20 at 13:23
  • @Lamanus If I understand it correctly, answers under that question do not address my situation. As I mentioned in question description, `createDataFrame` is not possible in structured streaming. Are you suggesting re-creating dataframe in a `foreachBatch` sink for each micro-batch dataframe? – Naitree Aug 22 '20 at 11:16
  • @kfkhalili I can make sure all null elements have already been filtered out from previous stages of dataframe transformation. – Naitree Aug 22 '20 at 11:17

1 Answers1

3

So the way to achieve this is with a UserDefinedFunction

// Problem setup
val df = Seq(Seq("xxx")).toDF("a")

df.printSchema
root
|-- a: array (nullable = true)
|    |-- element: string (containsNull = true)

Onto the solution:

import org.apache.spark.sql.types.{ArrayType, StringType}
import org.apache.spark.sql.functions.{udf, col}

// We define a sub schema with the appropriate data type and null condition
val subSchema = ArrayType(StringType, containsNull = false)

// We create a UDF that applies this sub schema
// while specifying the output of the UDF to be non-nullable
val applyNonNullableSchemaUdf =  udf((x:Seq[String]) => x, subSchema).asNonNullable

// We apply the UDF
val newSchemaDF = df.withColumn("a", applyNonNullableSchemaUdf(col("a")))

And there you have it.

// Check new schema
newSchemaDF.printSchema
root
|-- a: array (nullable = false)
|    |-- element: string (containsNull = false)

// Check that it actually works
newSchemaDF.show
+-----+
|    a|
+-----+
|[xxx]|
+-----+
kfkhalili
  • 996
  • 1
  • 11
  • 24