Spark CSV Handle Corrupt GZip Files

Asked Mar 13 '17 at 21:04

Active Mar 13 '17 at 21:04

Viewed 313 times

I have a spark 2.0 java application that is using sparks csv reading utilities to read a CSV file into a dataframe. The problem is that sometimes 1 out of 100 input files may be invalid ( corrupt gzip ) which causes the job to fail with:

java.lang.IllegalStateException: Error reading from input

When I used to read the files as text files and manually parse the CSV I was able to write a custom TextInputFormat to handle exceptions. I can't figure out how to specify a customer TextInputFormat when using spark's CSV reader. Any help would be appreciated.

Current code for reading CSV:

        Dataset<Row> csv = sparkSession.read()
            .option("delimiter", parseSettings.getDelimiter().toString())
            .option("quote", parseSettings.getQuote())
            .option("parserLib", "UNIVOCITY")
            .csv(paths);

Thanks, Nathan

asked Mar 13 '17 at 21:04

Nathan Case

Spark CSV Handle Corrupt GZip Files

0 Answers0