1

I have a spark 2.0 java application that is using sparks csv reading utilities to read a CSV file into a dataframe. The problem is that sometimes 1 out of 100 input files may be invalid ( corrupt gzip ) which causes the job to fail with:

java.lang.IllegalStateException: Error reading from input

When I used to read the files as text files and manually parse the CSV I was able to write a custom TextInputFormat to handle exceptions. I can't figure out how to specify a customer TextInputFormat when using spark's CSV reader. Any help would be appreciated.

Current code for reading CSV:

        Dataset<Row> csv = sparkSession.read()
            .option("delimiter", parseSettings.getDelimiter().toString())
            .option("quote", parseSettings.getQuote())
            .option("parserLib", "UNIVOCITY")
            .csv(paths);

Thanks, Nathan

Nathan Case
  • 655
  • 1
  • 6
  • 15

0 Answers0