2

I am using OpenCSV to read CSV files. Looking over the docs, I don't see guidelines on how to handle malformed data.

I have a CSV File. Comes with all the expected features: each field is separated by a comma, and each field is surrounded by quotes in case one of the values may contain a comma. However, every line (except the headers) is missing a leading quote. Here is an example

"Header 1","Header2"
value1","value2"
value1","value2"

The CSV parser ended up skipping every other line due to the way the quotes were lined up, which obviously causes problems.

I would consider this to be an error, because the first column is missing quotation marks since I know what the data should look like, but as far as the CSV spec is considered, this may be considered valid? If so, I suppose I would have to build extra checks myself to make sure that I am not missing any lines, despite it containing valid CSV data.

Stevoisiak
  • 23,794
  • 27
  • 122
  • 225
MxLDevs
  • 19,048
  • 36
  • 123
  • 194
  • Unbalanced quotes as in your example is definitely malformed. The lack of a "standard" specification doesn't preclude common sense. – Jim Garrison Jan 23 '18 at 23:11

2 Answers2

2

According of the rfc for CSV files:

While there are various specifications and implementations for the CSV format, there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.

So simply put, malformed? No. Informal? No. Even this article (Linked in the RFC) mentions that lines can be mixmatched with quotes and no quotes.

Community
  • 1
  • 1
Blue
  • 22,608
  • 7
  • 62
  • 92
  • That is unfortunate. It looks like I'll have to write my own checks to make sure I'm not missing any lines, or picking up lines incorrectly. – MxLDevs Jan 23 '18 at 21:07
  • @MxyL Check out [this post](https://stackoverflow.com/questions/41948442/parse-csv-with-opencsv-with-double-quotes-inside-a-quoted-field) – Blue Jan 23 '18 at 21:08
1

For the data you show:

"Header 1","Header2"
value1","value2"
value1","value2"

we could argue the data is not malformed if the fields would be considered as being not quoted and the fields never contain a separator and there are no multiline fields, which would give the values:

"Header 1"        "Header2"
value1"           "value2"
value1"           "value2"

Of course it's obvious this data was meant to have quoted fields. In that case the data is certainly malformed, and could be parsed differently with different parsers (maybe even as multiline fields).

Valid options would be:

value1,value2              // no quotes at all
"value1","value2"          // all quoted
value1,"value2,more data"  // only quoted when there is a separator inside
Danny_ds
  • 11,201
  • 1
  • 24
  • 46