When working with spark-streaming application, if all records in a batch from kafka misses a common field, then the dataframe schema changes everytime. I need a fixed dataframe schema for further processing and transformation operations.
I have pre-defined schema like this :
root
|-- PokeId: string (nullable = true)
|-- PokemonName: string (nullable = true)
|-- PokemonWeight: integer (nullable = false)
|-- PokemonType: string (nullable = true)
|-- PokemonEndurance: float (nullable = false)
|-- Attacks: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AttackName: string (nullable = true)
| | |-- AttackImpact: long (nullable = true)
But for some streaming sessions, I don't get all columns, and input schema is like :
root
|-- PokeId: string (nullable = true)
|-- PokemonName: string (nullable = true)
|-- PokemonWeight: integer (nullable = false)
|-- PokemonEndurance: float (nullable = false)
I am defining my schema like this :
val schema = StructType(Array(
StructField("PokeId", StringType, true),
StructField("PokemonName", StringType, true),
StructField("PokemonWeight", IntegerType, false),
StructField("PokemonType", StringType, true),
StructField("PokemonEndurance", FloatType, false),
StructField("Attacks", ArrayType(StructType(Array(
StructField("AttackName", StringType),
StructField("AttackImpact", LongType)
))))
))
Now, I don't know how to add missing columns( with null values ) in input dataframe based on this schema?
I have tried with spark-daria for Dataframe Validation, but it returns missing columns as a descriptive error. How to get missing columns from it.