14

I am trying to import a CSV file using TextFieldParser. A particular CSV file is causing me problems due to its nonstandard formatting. The CSV in question has its fields enclosed in double quotes. The problem appears when there is an additional set of unescaped double quotes within a particular field.

Here is an oversimplified test case that highlights the problem. The actual CSV files I am dealing with are not all formatted the same and have dozens of fields, any of which may contain these possibly tricky formatting issues.

TextReader reader = new StringReader("\"Row\",\"Test String\"\n" +
    "\"1\",\"This is a test string.  It is parsed correctly.\"\n" +
    "\"2\",\"This is a test string with a comma,  which is parsed correctly\"\n" +
    "\"3\",\"This is a test string with double \"\"double quotes\"\". It is parsed correctly\"\n" +
    "\"4\",\"This is a test string with 'single quotes'. It is parsed correctly\"\n" +
    "5,This is a test string with fields that aren't enclosed in double quotes.  It is parsed correctly.\n" +
    "\"6\",\"This is a test string with single \"double quotes\".  It can't be parsed.\"");

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };
    while (!parser.EndOfData)
    {
        string[] fields= parser.ReadFields();
        Console.WriteLine("This line was parsed as:\n{0},{1}",
            fields[0], fields[1]);
    }
}

Is there anyway to properly parse a CSV with this type of formatting using TextFieldParser?

sglantz
  • 2,063
  • 4
  • 20
  • 30
  • 6
    It is pretty important that you don't try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. There's another programmer somewhere that can easily fix this. – Hans Passant Apr 26 '13 at 18:00
  • 4
    @HansPassant While that is the ideal and "proper" course of action, many times we don't have a choice, for example when consuming files from an API that we have no control over, or a customer that is important and we need to just "make it work". – richard Sep 05 '15 at 19:41

6 Answers6

8

I agree with Hans Passant's advice that it is not your responsibility to parse malformed data. However, in accord with the Robustness Principle, some one faced with this situation may attempt to handle specific types of malformed data. The code I wrote below works on the data set specified in the question. Basically it detects the parser error on the malformed line, determines if it is double-quote wrapped based on the first character, and then splits/strips all the wrapping double-quotes manually.

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };

    while (!parser.EndOfData)
    {
        string[] fields = null;
        try
        {
            fields = parser.ReadFields();
        }
        catch (MalformedLineException ex)
        {
            if (parser.ErrorLine.StartsWith("\""))
            {
                var line = parser.ErrorLine.Substring(1, parser.ErrorLine.Length - 2);
                fields = line.Split(new string[] { "\",\"" }, StringSplitOptions.None);
            }
            else
            {
                throw;
            }
        }
        Console.WriteLine("This line was parsed as:\n{0},{1}", fields[0], fields[1]);
    }
}

I'm sure it is possible to concoct a pathological example where this fails (e.g. commas adjacent to double-quotes within a field value) but any such examples would probably be unparseable in the strictest sense, whereas the problem line given in the question is decipherable despite being malformed.

Jordan Rieger
  • 3,025
  • 3
  • 30
  • 50
  • 3
    Considering it has been almost exactly two years since I posted this question, I am not sure if this would have solved my initial issue. I ended up going with Hans' suggestion and demanded a file closer to spec. Since this does solve my example cases and I never accepted an answer, I will go ahead and accept your answer. Thanks, you saved me from become another DenverCoder9 - https://xkcd.com/979/ – sglantz Apr 06 '15 at 22:07
  • 3
    Ah, the obligatory XKCD :) Yeah, I knew I was opening an old question, but I had a very similar issue to yours, and when I figured out a solution I thought it would be good to share it. – Jordan Rieger Apr 07 '15 at 14:55
1

Jordan's solution is quite good, but it makes an incorrect assumption that the error line will always begin with a double-quote. My error line was this:

170,"CMS ALT",853,,,NON_MOVEX,COM,NULL,"2014-04-25",""  204 Route de Trays"

Notice the last field had extra/unescaped double quotes, but the first field was fine. So Jordan's solution didn't work. Here is my modified solution based on Jordan's:

using(TextFieldParser parser = new TextFieldParser(new StringReader(csv))) {
 parser.Delimiters = new [] {","};

 while (!parser.EndOfData) {
  string[] fields = null;
  try {
   fields = parser.ReadFields();
  } catch (MalformedLineException ex) {
   string errorLine = SafeTrim(parser.ErrorLine);
   fields = errorLine.Split(',');
  }
 }
}

You may want to handle the catch block differently, but the general concept works great for me.

HerrimanCoder
  • 6,835
  • 24
  • 78
  • 158
0

It may be easier to just do this manually, and it would certainly give you more control:

Edit: For your clarified example, i still suggest manually handling the parsing:

using System.IO;

string[] csvFile = File.ReadAllLines(pathToCsv);
foreach (string line in csvFile)
{
    // get the first comma in the line
    // everything before this index is the row number
    // everything after is the row value
    int firstCommaIndex = line.IndexOf(',');

    //Note: SubString used here is (startIndex, length) 
    string row = line.Substring(0, firstCommaIndex+1);
    string rowValue = line.Substring(firstCommaIndex+1).Trim();

    Console.WriteLine("This line was parsed as:\n{0},{1}",
            row, rowValue);
}

For a generic CSV that does not allow commas in the fields:

using System.IO;

string[] csvFile = File.ReadAllLines(pathToCsv);
foreach (string line in csvFile)
{
    string[] fields = line.Split(',');
    Console.WriteLine("This line was parsed as:\n{0},{1}",
            fields[0], fields[1]);
}
Steven Magana-Zook
  • 2,751
  • 27
  • 41
  • It looks like I might have simplified my example to the point that it isn't clear why I would use the TextFieldParser. A simple split on a comma will end up introducing a variety of different problems that aren't present when using the TextFieldParser. The main example is the existence of a comma within a specific text value. I will update the question with a more complex test string to highlight the benefits of using the TextFieldParser. – sglantz Apr 26 '13 at 16:40
  • @sglantz : Updated code sample. I'm still suggesting that you will get max control doing this manually. I found the problem with CSV and other parsers is that its easy to have data that doesnt work with them even when they are programmed to be pretty generic. I think the new code sample is easier to read, but can be accomplished as well using C# regular expressions and the Match class. – Steven Magana-Zook Apr 26 '13 at 17:07
  • It still looks like the example still doesn't convey the complexity of the CSV files I am dealing with. They are not consistently formatted. Commas and quotes might appear in any of the 20+ fields in the file. TextFieldParser is good at handling that inconsistency while manually splitting and even regex become very complex very quickly when dealing with the variety of different formats. – sglantz Apr 26 '13 at 17:42
0

Working Solution :

using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
            {
                csvReader.SetDelimiters(new string[] { "," });
                csvReader.HasFieldsEnclosedInQuotes = false;
                string[] colFields = csvReader.ReadFields();

                while (!csvReader.EndOfData)
                {
                    string[] fieldData = csvReader.ReadFields();
                    for (i = 0; i < fieldData.Length; i++)
                    {
                        if (fieldData[i] == "")
                        {
                            fieldData[i] = null;
                        }
                        else
                        {
                            if (fieldData[i][0] == '"' && fieldData[i][fieldData[i].Length - 1] == '"')
                            {
                                fieldData[i] = fieldData[i].Substring(1, fieldData[i].Length - 2);
                            }
                        }
                    }
                    csvData.Rows.Add(fieldData);
                   }
            }
0

If you dont set HasFieldsEnclosedInQuotes = true the resultant list of columns will be more if the data contains (,) comma. e.g "Col1","Col2","Col3" "Test1", 100, "Test1,Test2" "Test2", 200, "Test22" This file should have 3 columns but while parsing you will get 4 fields which is wrong.

NitinK
  • 103
  • 1
  • 7
-1

Please set HasFieldsEnclosedInQuotes = true on TextFieldParser object before you start reading file.

optimusprime
  • 49
  • 1
  • 7
  • 1
    That option can be turned on and the TextParser will still be unable to parse the text. The problem is not that the fields are enclosed in quotes but within the fields there are quotes that are not properly escaped by using two quotes. – sglantz May 16 '13 at 18:48