How to add missing fields to dataframe based on pre-defined schema?

Question

When working with spark-streaming application, if all records in a batch from kafka misses a common field, then the dataframe schema changes everytime. I need a fixed dataframe schema for further processing and transformation operations.

I have pre-defined schema like this :

root
 |-- PokeId: string (nullable = true)
 |-- PokemonName: string (nullable = true)
 |-- PokemonWeight: integer (nullable = false)
 |-- PokemonType: string (nullable = true)
 |-- PokemonEndurance: float (nullable = false)
 |-- Attacks: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- AttackName: string (nullable = true)
 |    |    |-- AttackImpact: long (nullable = true)

But for some streaming sessions, I don't get all columns, and input schema is like :

root
 |-- PokeId: string (nullable = true)
 |-- PokemonName: string (nullable = true)
 |-- PokemonWeight: integer (nullable = false)
 |-- PokemonEndurance: float (nullable = false)

I am defining my schema like this :

val schema = StructType(Array(
  StructField("PokeId", StringType, true),
  StructField("PokemonName", StringType, true),
  StructField("PokemonWeight", IntegerType, false),
  StructField("PokemonType", StringType, true),
  StructField("PokemonEndurance", FloatType, false),
  StructField("Attacks", ArrayType(StructType(Array(
          StructField("AttackName", StringType),
          StructField("AttackImpact", LongType)
        ))))
))

Now, I don't know how to add missing columns( with null values ) in input dataframe based on this schema?

I have tried with spark-daria for Dataframe Validation, but it returns missing columns as a descriptive error. How to get missing columns from it.

updated my question. I have to perform dataframe transformation operations on input dataframe, if randomly some fields doesn't exist in dataframe, it will give errors in rest of the code. @JacekLaskowski — Shubham, Jul 25 '19 at 05:01
Thanks for the updates. Could you show the current version of your code that you use the schema to read data and would like to have `null`s for unavailable fields? — Jacek Laskowski, Jul 25 '19 at 10:22
You can do a schema comparison as shown here https://stackoverflow.com/questions/47862974/schema-comparison-of-two-dataframes-in-scala/. If the schema between the two dataframes are different then change your select statement accordingly — abiratsis, Oct 29 '19 at 18:05

How to add missing fields to dataframe based on pre-defined schema?

0 Answers0