1

I have a Spark job (in CDH 5.5.1) that loads two Avro files (both with the same schema), combines them to make a DataFrame (also with the same schema) then writes them back out to Avro.

The job explicitly compares the two input schemas to ensure they are the same.

This is used to combine existing data with a few updates (since the files are immutable). I then replace the original file with the new combined file by renaming them in HDFS.

However, if I repeat the update process (i.e. try to add some further updates to the previously updated file), the job fails because the schemas are now different! What is going on?

DNA
  • 42,007
  • 12
  • 107
  • 146

1 Answers1

5

This is due to the behaviour of the spark-avro package.

When writing to Avro, spark-avro writes everything as unions of the given type along with a null option.

In other words, "string" becomes ["string", "null"] so every field becomes nullable.

If your input schema already contains only nullable fields, then this problem doesn't become apparent.

This isn't mentioned on the spark-avro page, but is described as one of the limitations of spark-avro in some Cloudera documentation:

Because Spark is converting data types, watch for the following:

  • Enumerated types are erased - Avro enumerated types become strings when they are read into Spark because Spark does not support enumerated types.
  • Unions on output - Spark writes everything as unions of the given type along with a null option.
  • Avro schema changes - Spark reads everything into an internal representation. Even if you just read and then write the data, the schema for the output will be different.
  • Spark schema reordering - Spark reorders the elements in its schema when writing them to disk so that the elements being partitioned on are the last elements.

See also this github issue: (spark-avro 92)

DNA
  • 42,007
  • 12
  • 107
  • 146