find out what file is responsible for exception

Asked Apr 11 '16 at 14:36

Active Apr 11 '16 at 15:06

Viewed 94 times

I'm opening a bunch of files (around 50) at HDFS like this:

val PATH = path_to_files
val FILE_PATH = "PATH+nt_uuid_2016-03-01.*1*.avro"
val df = sqlContext.read.avro(FILE_PATH)

I then do a bunch of operations with df and at some point I get:

java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at org.apache.avro.mapred.AvroRecordReader.(AvroRecordReader.java:41)
at org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:71)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

I suspect maybe there is an issue with one of the files, but I don't know which one. If I run with one of them, the job finishes correctly.

Is there a way to catch the exception and figure out which is the bad apple?

edited Apr 11 '16 at 15:06

asked Apr 11 '16 at 14:36

elelias

4,552
5
30
45

What makes you think that it is supposed to understand wildcards? – Dima Apr 11 '16 at 15:20
well, I guess the fact that wildcards are very widespread? But it's mostly the fact that wildcards do seem to be working OK. Using `nt_uuid_2016-03-01.*9.avro` works fine, for example – elelias Apr 11 '16 at 15:55
in any case, how would you read the files? – elelias Apr 11 '16 at 16:05

find out what file is responsible for exception

0 Answers0