Parsing CSV in groovy with exception tolerance

Question

I've been trying to parse a csv file in groovy, currently using the library org.apache.commons.csv 2.4. The requirement I have is that there are invalid data value in csv cells, such as invalid characters, and instead of throwing an exception on first invalid row/cell, I want to collect these exceptions and keep iterating in the csv file until the end, then I will have a full list of invalid data this csv file has.

With that purpose, I've tried multiple ways of using this apache lib, but unfortunately as long as it uses the CSVParser.getNextRecord() for iteration, the iterator will just abort.

put in code, something like this:

    def  records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())

     // at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
     records.each {record->
         // if the exception is thrown from .each, that makes below try/catch in vain
         try{

         }catch(e){ //want collect Errors here }
     }

So, is there anything else that I should dig in this library? Or could anybody point me to another more viable solution? Many thanks to all!

Update: Sample CSV

"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"

The second data row has invalid char " that makes parser throw exception

Can you give an example of format and "invalid characters"? – jalopaba Nov 11 '15 at 11:40 — jalopaba, Nov 11 '15 at 11:40

score 2 · Accepted Answer · answered Nov 11 '15 at 14:45

The problem you have is that one of the characters in one cell is the quote character used by the parser according with the format selected: CSVFormat.EXCEL.

The quote character is

the character used to encapsulate values containing special characters

so in your example the quote is misused and the parser complains about it.

You can workaround that using a different CSVFormat. For example, one without quote character:

@Grapes(
    @Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)

import java.nio.charset.*
import org.apache.commons.csv.*

def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''

def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))

parsed.getRecords().each {
    println it.toMap().values()
}

And the above yields:

[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]

Of course, with the above workaround, you have the quotes (") included in each field.

You can replace them all if you want:

parsed.getRecords().each {
    println it.toMap().values().collect({ it.replace('"', '') })
}

This is better @jalopaba, how about other invalid char though? I mean its the extra quote in cell causing issue this time and there is ignoreQuote in format that we can use, but does it cater for all invalid chars? Or really in the parsing CSV world the only thing annoys parse is extra quotes? I'm pretty new in this area might have just asked a silly question... — James Jiang, Nov 11 '15 at 21:37
You should consider also the escape character in CSVFormat, since it also affects how parsing is done. — jalopaba, Nov 11 '15 at 22:30
Just checked the commons.csv.Lexer class that it validates the chars with the exception only against quotes and escapes. withQuote(null) worked in my case, and another issue is raised immediately... what if there is delimiter char ( `,` in this case) in the cell value, and it will break the whole data row... — James Jiang, Nov 11 '15 at 23:56
The problem with a *wrong* delimiter char is that you cannot know which delimiter is the right one... so you have a malformed CSV file and the parser will fail. Maybe you can pre-process your file (splitting using `",` for example) and try to get a well formed CSV before parsing... — jalopaba, Nov 12 '15 at 08:56

score 0 · Answer 2 · edited Nov 11 '15 at 14:46

0

The problem is that if the csv file has invalid data, meaning data that breaks the rules of the csv format, then the parser cannot... parse. That's why it cannot reliably parse any more than the first error encountered.

edited Nov 11 '15 at 14:46

jalopaba

8,039
2
44
57

answered Nov 11 '15 at 11:29

Emmanuel Rosa

9,697
2
14
20

Parsing CSV in groovy with exception tolerance

2 Answers2