Spark read avro results from a previous write results in "Not an avro data file" due to _SUCCESS file

Asked Jul 25 '17 at 19:13

Active Jul 25 '17 at 19:21

Viewed 653 times

I'm using the great databricks connector to read/write avro files. I have the following code

df.write.mode(SaveMode.Overwrite).avro(someDirectory)

Problem is that when I try to read this directory using sqlContext.read.avro(someDirectory)

it fails with

java.io.IOException: Not an Avro data file

due to the existence of the _SUCCESS file in that directory.

setting sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") solves the issue but I rather avoid doing it.

This sounds like a quite generic problem so I may be doing something wrong?

edited Jul 25 '17 at 19:21

asked Jul 25 '17 at 19:13

Hagai

Ehm, why would you _"rather avoid"_ setting the configuration that solves the problem? The thing is, Spark creates the `_SUCCESS` file by default (and many users are very happy with this), so when you have a specific scenario where you do not want the file, it seems only fair to me that _you_ have to set the configuration to disable it. – Glennie Helles Sindholt Jul 26 '17 at 09:17
just because it's sounds like a hack to solve a more common problem. – Hagai Jul 26 '17 at 11:42
But it is not necessarily a _"common problem"_. Some people prefer it and some people don't - which is why the configuration option is nice :) – Glennie Helles Sindholt Jul 26 '17 at 11:52
What i'm trying to say is that it's not a matter of preference, because leaving the file simply won't work - anyone who reads / writes avro files must have this setting, which sounds a bit odd. – Hagai Jul 26 '17 at 13:35

0 Answers0