3

I am trying to parse a CSV file using JakartaCommons-csv

Sample input file

Field1,Field2,Field3,Field4,Field5
"Ryan, R"u"bianes","  dummy@gmail.com","29445","626","South delhi, Rohini 122001"

Formatter: CSVFormat.newFormat(',').withIgnoreEmptyLines().withQuote('"') CSV_DELIMITER is ,

Output

  1. Field1 value after CSV parsing should be : Ryan, R"u"bianes
  2. Field5 value after CSV parsing should be : South delhi, Rohini 122001

Exception: Caused by: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
Emperor
  • 59
  • 1
  • 4

2 Answers2

3

The problem is that your file is not following the accepted standard for quoting in CSV files. The correct way to represent a quote in a quoted string is by repeating the quote. For example.

Field1,Field2,Field3,Field4,Field5
"Ryan, R""u""bianes","  dummy@gmail.com","29445","626","South delhi, Rohini 122001"

If you restrict yourself to the standard form of CSV quoting, the Apache Commons CSV parser should work.

Unfortunately, it is not feasible to write a consistent parser for your variant format because there is no way disambiguate an embedded comma and a field separator if you need to represent a field containing "Ryan R","baines".

The rules for quoting in CSV files are set out in various places including RFC 4180.

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • [univocity-parsers](https://www.univocity.com/pages/parsers-tutorial) handles this, check my other answer – Jeronimo Backes May 19 '18 at 17:31
  • @stephen-c Is there a way we can skip this particular row and continue the processing for the next set of rows? – Emperor Jun 06 '18 at 08:00
  • Looking at the Commons CSV parser code, I think the answer is no. – Stephen C Jun 06 '18 at 09:11
  • @stephen-c That was my hunch as well by looking at the code but it's a feature which should ideally have been there. Any suggested workaround. I am thinking of doing something like this do { try { while (csvRecords.hasNext()) { CSVRecord record = csvRecords.next(); } } catch (Exception e) { log.error("Exception occurred while parsing one of the input record"); } } while (csvRecords.hasNext()); – Emperor Jun 06 '18 at 09:40
  • It might work. (Try it and see!!!) But I doubt it. I didn't notice any code to skip to the end of line before throwing the exception. – Stephen C Jun 06 '18 at 10:06
  • Yes it seems to partially work @stephen-c Didn't followed the comment however. I guess what you mean is to do csvRecords.next(); inside catch block. Issue: If i don't do "csvRecords.next(); inside catch block" then it resumes from the position the error occured in the same line. – Emperor Jun 06 '18 at 10:26
0

The problem here is that the quotes are not properly escaped. Your parser doesn't handle that. Try univocity-parsers as this is the only parser for java I know that can handle unescaped quotes inside a quoted value. It is also 4 times faster than Commons CSV. Try this code:

    //configure the parser to handle your situation
    CsvParserSettings settings = new CsvParserSettings();
    settings.setHeaderExtractionEnabled(true); //uses first line as headers
    settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);
    settings.trimQuotedValues(true); //trim whitespace around values in quotes

    //create the parser
    CsvParser parser = new CsvParser(settings);

    String input = "" +
            "Field1,Field2,Field3,Field4,Field5\n" +
            "\"Ryan, R\"u\"bianes\",\"  dummy@gmail.com\",\"29445\",\"626\",\"South delhi, Rohini 122001\"";

    //parse your input
    List<String[]> rows = parser.parseAll(new StringReader(input));

    //print the parsed values
    for(String[] row : rows){
        for(String value : row){
            System.out.println('[' + value + ']');
        }
        System.out.println("-----");
    }

This will print:

[Ryan, R"u"bianes]
[dummy@gmail.com]
[29445]
[626]
[South delhi, Rohini 122001]
-----

Hope it helps.

Disclosure: I'm the author of this library, it's open source and free (Apache 2.0 license)

Jeronimo Backes
  • 6,141
  • 2
  • 25
  • 29
  • What does it do for the nasty example in my Question? – Stephen C May 20 '18 at 00:56
  • Look carefully. It parses yout broken input by handling the unescaped quote properly. I used your sample input for testing as you can see. As it is as plain String I put slashes in front of each quote. Give this code a go with your actual input file. – Jeronimo Backes May 20 '18 at 01:52