4

So for something like this:

case class RandomClass(stringOne: String, stringTwo: String, numericOne: Int)
val ds = Seq(
  RandomClass("a", null, 1),  
  RandomClass("a", "x", 3), 
  RandomClass("a", "y", 4), 
  RandomClass("a", null, 5)
).toDS()

ds.printSchema()

results in

root
 |-- stringOne: string (nullable = true)
 |-- stringTwo: string (nullable = true)
 |-- numericOne: integer (nullable = false)

why would stringOne be nullable? Strangely, numericOne is inferred correctly. I assume I am just missing something about the relationship between Dataset and DataFrame API?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
hiroprotagonist
  • 902
  • 1
  • 11
  • 24

2 Answers2

2

why would stringOne be nullable

Because Scala String is just a Java String and unlike Scala Int can be null. Actual content (presence of null values or lack of it) simply doesn't matter.

See also spark why do columns change to nullable true

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
1

It is true that Spark makes a best guess on nullability depending on whether the inferred type lies on the AnyRef or AnyVal side of the Scala object hierarchy, but note also that it can be more complicated than that. For example, when working with Parquet files, everything is inferred to be nullable for compatibility purposes.

Meanwhile, when you create a schema, you can simply set nullable = true everywhere if you like:

StructField(fieldName, LongType, nullable = true)

// or using a "DSL"
$"fieldName".long.copy(nullable = false)
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Vidya
  • 29,932
  • 7
  • 42
  • 70