2

I am trying to parse a number of CSV files that has double quotes and commas within the fields. I have no control over the format of the CSVs and instead of using "" to escape the quotes it is using \". The files are also extremely large so reading and using regex isn't the best option for me.

I would prefer to use an existing library and notewrite an entirely new parser. Currently I am using CSVHelper

This is an example of the CSV data:

"id","name","notes" "40","Continue","If the message \"Continue\" does not appear restart, and notify your instructor." "41","Restart","If the message \"Restart\" does not appear after 10 seconds, restart manually."

The problem is the double quotes aren't being escaped properly and the , is being read as a delimiter and separating the notes field into 2 separate fields.

This is my current code that is doesn't work.

DataTable csvData = new DataTable();
string csvFilePath = @"C:\Users\" + csvFileName + ".csv";

try
{
    FileInfo file = new FileInfo(csvFilePath);
    using (TextReader reader = file.OpenText())
    using (CsvReader csv = new CsvReader(reader))
    {
        csv.Configuration.Delimiter = ",";
        csv.Configuration.HasHeaderRecord = true;
        csv.Configuration.IgnoreQuotes = false; 
        csv.Configuration.TrimFields = true; 
        csv.Configuration.WillThrowOnMissingField = false;
        string[] colFields = null;
        while(csv.Read())
        {
            if (colFields == null)
            {
                colFields = csv.FieldHeaders;
                foreach (string column in colFields)
                {
                    DataColumn datacolumn = new DataColumn(column);
                    datacolumn.AllowDBNull = true;
                    csvData.Columns.Add(datacolumn);
                }
            }
            string[] fieldData = csv.CurrentRecord;

            for (int i = 0; i < fieldData.Length; i++)
            {
                if (fieldData[i] == "")
                {
                    fieldData[i] = null;
                }
            }
            csvData.Rows.Add(fieldData); 
        }
    }
}

Is there an existing library that lets you specify how to escape quotes or should I just write my own parser?

nastassiar
  • 1,545
  • 2
  • 24
  • 43
  • 1
    http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader – Mitch Wheat Apr 27 '15 at 23:44
  • I should have specified I am using .Net 4.5, and that targets .Net 2.0 – nastassiar Apr 27 '15 at 23:48
  • so download source and re-compile!! – Mitch Wheat Apr 27 '15 at 23:49
  • @DourHighArch: that one is already closed. – abatishchev Apr 28 '15 at 00:25
  • Out of curiosity... you said huge files... how big are they exactly, because if they're big as in 100mb+ big, then you're in for a whole host of fun.... – Aydin Apr 28 '15 at 01:09
  • There are 8 files that range from 2kb to 300mb – nastassiar Apr 28 '15 at 16:56
  • It might interest you and others having this problem, that CsvHelper offers a workaround via `CsvMode` like this: `csvConfig.mode = CsvMode.Escape` [source](https://github.com/JoshClose/CsvHelper/issues/835#issuecomment-763000666). I would like to make this an answer but it seems like I can't because of the "duplicate answer" flag – om-ha Feb 13 '21 at 00:49

1 Answers1

3

You can get quite far when using a very simple linq statement to split and trim and finally Replace for unescaping quotes in the content:

DataTable csvData = new DataTable();
string csvFilePath = @"C:\Users\" + csvFileName + ".csv";
try
{
    string[] seps = { "\",", ",\"" };
    char[] quotes = { '\"', ' ' };
    string[] colFields = null;
    foreach (var line in File.ReadLines(csvFilePath))
    {
        var fields = line
            .Split(seps, StringSplitOptions.None)
            .Select(s => s.Trim(quotes).Replace("\\\"", "\""))
            .ToArray();

        if (colFields == null)
        {
            colFields = fields;
            foreach (string column in colFields)
            {
                DataColumn datacolumn = new DataColumn(column);
                datacolumn.AllowDBNull = true;
                csvData.Columns.Add(datacolumn);
            }
        }
        else
        {
            for (int i = 0; i < fields.Length; i++)
            {
                if (fields[i] == "")
                {
                    fields[i] = null;
                }
            }
            csvData.Rows.Add(fields); 
        }
    }
}

When used in a very simple console app, and OPs original input in the "test.txt" file:

public static void CsvUnescapeSplit()
{
    string[] seps = { "\",", ",\"" };
    char[] quotes = { '\"', ' ' };
    foreach (var line in File.ReadLines(@"c:\temp\test.txt"))
    {
        var fields = line
            .Split(seps, StringSplitOptions.None)
            .Select(s => s.Trim(quotes).Replace("\\\"", "\""))
            .ToArray();
        foreach (var field in fields)
            Console.Write("{0} | ", field);
        Console.WriteLine();
    }
}

This produces the following (correct) output:

id | name | notes |
40 | Continue | If the message "Continue" does not appear restart, and notify your instructor. |
41 | Help | If the message "Restart" does not appear after 10 seconds, manually restart. |

Caveat: If your field separators have spaces, like these:

"40" , "Continue" , "If the message \"Continue\" does not appear restart, and notify your instructor."

Or your content strings contain commas directly after a quote, like here (after "Restart"):

"41","Help","If the message \"Restart\", does not appear after 10 seconds, manually restart."

It will fail.

Alex
  • 13,024
  • 33
  • 62
  • What is colFields? What is csvData? – aquinas Apr 28 '15 at 00:00
  • @aquinas `colFields` was forgotten in the copy & paste edit from OPs code. Added that back in now, `csvData` appears to be a variable used but not declared in OPs code snippet. – Alex Apr 28 '15 at 00:03
  • The code fails when you try to run it with the OPs original data set. – aquinas Apr 28 '15 at 00:25
  • It works when I use string[] seps = { "\",\"" }; and .Select(s => s.Trim(quotes).Replace("\"\"", "'")) Thank you! – nastassiar Apr 28 '15 at 00:51