1

I am trying to parse a CSV file using Lumenworks CsvReader. Each data point is wrapped in double quotes, however, some values contain unescaped double quotes within the data, and other values contain commas within the data. The issue I am facing is that when I parse this with CsvReader, extra columns are ending up in my file due to Lumenworks seeing these characters as delimiters.

As you will read below, I've handled the issue with unescaped double quotes using a known solution, but this then results in issue of extra columns being generated for the data with commas inside.

Example: 2 columns (each wrapped in quotes), with unescaped double quotes in one of the data points

"Name","Description"

"Bob","I am a "cool" guy"

When attempting to perform csvReader.ReadNextRecord(), instead of splitting this up into 2 columns, it splits it up into 4 columns:

  • Bob
  • I am a
  • cool
  • guy

I've used the solution provided in Reading csv having double quotes with lumenwork csv reader and it works quite well!

This is how I've implemented it:

Char quotingCharacter = '\0' ;
Char escapeCharacter = quotingCharacter;
Char delimiter = ',';
using (CsvReader csvReader = new CsvReader(reader, false, delimiter, quotingCharacter, escapeCharacter, quotingCharacter, ValueTrimmingOptions.All))
{....
csvReader.ReadNextRecord();
...}

HOWEVER, when I implement this fix for my CSV file, it then creates the same issue with columns that have commas inside:

Example: 2 columns (each wrapped in quotes), with commas in one of the data points, after implementing the double quote workaround

"Name","Description"

"Bob","I am related to Suzie, Betty, and Tommy"

With the aforementioned solution implemented, the csvReader now does not know to read the commas as part of the data. Instead of 2 columns, I am left with 4 columns:

  • Bob
  • I am related to Suzie
  • Betty
  • and Tommy

So the question is: how do I allow Lumenworks CsvReader to work around this bad data and have it interpret unescaped double quotes as the data itself? How can this be done in a way that doesn't then cause the commas within the data to be interpreted as the delimitation?

Community
  • 1
  • 1
  • 1
    An errant, extra quote indicates bad data. – Ňɏssa Pøngjǣrdenlarp Sep 12 '16 at 22:43
  • @Plutonix Yes, the data is bad :) The goal I am trying to achieve is to work around the bad data and successfully bring in 2 columns – Sentient Entities Sep 13 '16 at 00:02
  • 1
    What you are asking for is basically impossible without some out-of-band heuristics on where quotes are allowed and where commas are allowed. The examples you posted are not sufficient for us to create such heuristics. Have whoever is sending you the data fix it. If this is impossible, fix it yourself before sending it to the CsvReader. – Dour High Arch Sep 13 '16 at 00:14
  • @DourHighArch thanks for the points. Yes, I am currently taking the approach of cleaning it before sending it through CsvReader, though I am still hoping to discover if there is an interesting way to use the existing CsvReader functionality in a way that might have been built in...or cheesed in by someone more clever than myself :) – Sentient Entities Sep 13 '16 at 00:23

0 Answers0