Apache commons CSV ignore corrupted or invalid records in a csv file and continue parsing

Question

I am trying to parse an almost valid CSV file containing data that is 99.9% correct and valid. However halfway through there are a couple of records that are invalid (too many quotes) e.g.

a,b,"c",d 
a,b,""c""",d

My code

    try (Reader reader = new BufferedReader(new FileReader(file), BUFFERED_READER_SIZE);
         CSVParser csvParser = new CSVParser(reader, CSVFormat.EXCEL)
    ) {
        Iterator<CSVRecord> iterator = csvParser.iterator();
        CSVRecord record;
        while (iterator.hasNext()) {
            try {
                record = iterator.next();
            } catch (IllegalStateException e) {
            }
        }
    } catch (IOException e) {
    }

How do I parse a CSV so that when it encounters an invalid row/record it just skips it and moves on to the next line?

terrorrussia-keeps-killing · Answer 1 · 2020-12-17T15:31:26.590

I don't think you can do much to work around it. CSVParser is a final class and does not let controlling the way it parses the underlying stream. However, it is sort possible to work around it by having a custom iterator that would do the trick.

public final class Csv {

    private Csv() {
    }

    public interface ICsvParserFactory {

        @Nonnull
        CSVParser createCsvParser(@Nonnull Reader lineReader);

    }

    public static Stream<CSVRecord> tryParseLinesLeniently(final BufferedReader bufferedReader, final ICsvParserFactory csvParserFactory) {
        return bufferedReader.lines()
                .map(line -> {
                    try {
                        return csvParserFactory.createCsvParser(new StringReader(line))
                                .iterator()
                                .next();
                    } catch ( final IllegalStateException ex ) {
                        return null;
                    }
                })
                .filter(Objects::nonNull)
                .onClose(() -> {
                    try {
                        bufferedReader.close();
                    } catch ( final IOException ex ) {
                        throw new RuntimeException(ex);
                    }
                });
    }

}

However, I don't think it's a good idea in any case:

It cannot return a CSVParser instance.
It might return an Iterator<CSVRecord> instead of Stream<CSVRecord> (and save of the filter operation) but I just find streams more simple to implement.
It creates a new CSV parser for each line, therefore this method creates many objects for a CSV document that contains "too many" lines. The string reader can be probably made reusable.
The whole idea of the method is that it, not being a CSV parser, assumes that each lines holds one line only (I don't really remember if CSV/TSV allow multiline records), so it violates CSV parsing rules just by design. It does not support headers yet (but can be easily improved).

final Csv.ICsvParserFactory csvParserFactory = lineReader -> {
    try {
        return new CSVParser(lineReader, CSVFormat.EXCEL);
    } catch ( final IOException ex ) {
        throw new RuntimeException(ex);
    }
};
try ( final Stream<CSVRecord> csvRecords = Csv.tryParseLinesLeniently(new BufferedReader(reader), csvParserFactory) ) {
    csvRecords.forEachOrdered(System.out::println);
}

If possible, please let your CSV parser consume valid CSV documents not using any workarounds like this one.

Edit 1

There is an implementation flaw in the code above: ALL records returned from the stream now have the recordNumber set to 1.

Now I do believe the request cannot be fixed using the Apache Commons CSV parser, since the only CSVRecord constructor is also package-private and cannot be instantiated outside that package if not using either reflection or intruding to its declaring package.

Sorry you have either fix your CSV documents, or use another parser that can parse "more leniently".

CSV (most dialects) allows fields containing a newline if the field is quoted. But respecting that would make the "ignore rows with too many quotes" request impossible to satisfy (since you can't really tell where the end of the row is), so the assumption that this file has no multiline rows is necessary. — rici, Dec 17 '20 at 14:48
@rici Right, chunking the input line by line both violates the CSV grammar and does not respect the record number that I have just found in the `CSVRecord` class. — terrorrussia-keeps-killing, Dec 17 '20 at 15:34
Instead of passing the input a line at a time, you could try to insert a quote at the end of lines with an odd number of quotes. Inserting the quote might be a bit slow but it should be infrequent. — rici, Dec 17 '20 at 15:43

score 0 · Answer 2 · answered Jul 01 '22 at 14:12

I am using Apache CSV commons version 1.9.0 and I am able to continue retrieving rows after the invalid rows by simply "absorbing" the exception and just continuing. Keep in mind that the hasNext() method actually pre-fetches the next row, so it can throw the IllegalStateException as well as the next() method.

If you absorb the exception, the next CSVRecord retrieved will be a mangled version of the invalid row, so you will want to skip it. I cannot post my code as it is the IP of my employer, but hopefully this helps.

Apache commons CSV ignore corrupted or invalid records in a csv file and continue parsing

2 Answers2

Edit 1