2

I try to read in a comma separated CSV-file which looks like this:

"Row ID","StringCol","idxCol"
"INDEX","object","float64"
"Row3","carriage return 
 carriage return",0.0
"Row4","new line 
 new line",1.0
"Row5","carriage return and new line 
 carriage return and new line",2.0
"Row10","",3.0
  • All Strings are quoted with "
  • separator is comma
  • Line ending is carriage return + line feed
  • line breaks {\r or \n) within quotes should be left untouched

The following code fails to read it in correctly:

CSVParser parser = new CSVParserBuilder()
        .withEscapeChar(CSVParser.DEFAULT_ESCAPE_CHARACTER)
        .withSeparator(CSVParser.DEFAULT_SEPARATOR)
        .withQuoteChar(CSVParser.DEFAULT_QUOTE_CHARACTER)
        .withStrictQuotes(false)
        .build();

File tempFile = new File("test.csv");

try (BufferedReader br = Files.newBufferedReader(tempFile.toPath(), StandardCharsets.UTF_8);
        CSVReader reader = new CSVReaderBuilder(br).withCSVParser(parser)
                .withKeepCarriageReturn(true)
                .build()) {

        for(String[] line : reader) {
            System.out.println(Arrays.toString(line));
        }

} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

The output would look like this:

[Row ID, StringCol, idxCol"
]
[INDEX, object, float64"
]
[Row3, carriage return 
 carriage return, 0.0
]
[Row4, new line 
 new line, 1.0
]
[Row5, carriage return and new line 
 carriage return and new line, 2.0
]
[Row10, , 3.0
]

As you can see, if there is a quote before the carriage return at the end of the line, it's kept as part of the string. Seems that \r is kept as part of the entry, though it's not within the quotes. Which is a weird behavior, as it ignores the quoting of that entry. Additionally it also keeps the last quote character as part of the string.

Basically, I see no way to keep carriage return within quotes but still be able to correctly read the last entry (I would not mind to remove the carriage return sign at the end of the line but I cannot always expect to have a quote character before. Or, I would have to remove both with a regex expecting at least the carriage return with an optional quote character before at line end but I might get into trouble if this strange behavior changes in the future.

Antje Janosch
  • 1,154
  • 5
  • 19
  • 37

0 Answers0