Why does Spark SQL turn nullable on for string column even when all values specified?

Question

So for something like this:

case class RandomClass(stringOne: String, stringTwo: String, numericOne: Int)
val ds = Seq(
  RandomClass("a", null, 1),  
  RandomClass("a", "x", 3), 
  RandomClass("a", "y", 4), 
  RandomClass("a", null, 5)
).toDS()

ds.printSchema()

results in

root
 |-- stringOne: string (nullable = true)
 |-- stringTwo: string (nullable = true)
 |-- numericOne: integer (nullable = false)

why would stringOne be nullable? Strangely, numericOne is inferred correctly. I assume I am just missing something about the relationship between Dataset and DataFrame API?

score 2 · Accepted Answer · edited May 23 '17 at 10:30

2

why would stringOne be nullable

Because Scala String is just a Java String and unlike Scala Int can be null. Actual content (presence of null values or lack of it) simply doesn't matter.

See also spark why do columns change to nullable true

edited May 23 '17 at 10:30

Community

1
1

answered Mar 31 '17 at 19:35

zero323

322,348
103
959
935

score 1 · Answer 2 · edited Apr 03 '17 at 19:42

It is true that Spark makes a best guess on nullability depending on whether the inferred type lies on the AnyRef or AnyVal side of the Scala object hierarchy, but note also that it can be more complicated than that. For example, when working with Parquet files, everything is inferred to be nullable for compatibility purposes.

Meanwhile, when you create a schema, you can simply set nullable = true everywhere if you like:

StructField(fieldName, LongType, nullable = true)

// or using a "DSL"
$"fieldName".long.copy(nullable = false)

Why does Spark SQL turn nullable on for string column even when all values specified?

2 Answers2