I have a Spark job (in CDH 5.5.1) that loads two Avro files (both with the same schema), combines them to make a DataFrame (also with the same schema) then writes them back out to Avro.
The job explicitly compares the two input schemas to ensure they are the same.
This is used to combine existing data with a few updates (since the files are immutable). I then replace the original file with the new combined file by renaming them in HDFS.
However, if I repeat the update process (i.e. try to add some further updates to the previously updated file), the job fails because the schemas are now different! What is going on?