2

Consider, this message:

N,8545,01/02/2011 09:15:01.815,"RASTA OPTSTK 24FEB2011 1,150.00 CE",S,8.80,250,0.00,0

This is just a sample. The idea is, this is one of the rows in a csv file. Now, if I am to break it into commas, then there will be a problem with 1150 figure.

The string inside the double quotes is of variable length, but can be ascertained as one "element"(if I may use the term) The other elements are the ones separated by ,

How do I parse it? (other than Ragel parsing engine)

Soham

Soham
  • 863
  • 2
  • 19
  • 35
  • 1
    What you have is a field that is "optionally" enclosed by quotation marks. See http://programmers.stackexchange.com/questions/26284/best-way-to-handle-delimited-files, http://stackoverflow.com/questions/4479977/csv-parsing-for-embedded-double-quotes, http://stackoverflow.com/questions/736629/parse-delimited-csv-in-net, and many others. – JYelton Feb 15 '11 at 21:11
  • Who the effing hell decided to escape commas and double quotes with double quotes??? Life, if any more interesting, will be a tad boring – Soham Feb 15 '11 at 21:28

4 Answers4

4

Break the string into fields separated by commas provided that the commas are not embedded in quoted strings.

A quick way to do this is to use a state machine.

boolean inQuote = false;
StringBuffer buffer= new StringBuffer();
// readchar() is to be implemented however you read a char
while ((char = readchar()) != -1) {
  switch (char) {

    case ',':
      if (inQuote == false) {
         // store the field in our parsedLine object for later processing.
         parsedLine.addField(buffer.toString());
         buffer.setLength(0);
      }
      break;

    case '"': 
      inQuote = !inQuote;
      // fall through to next target is deliberate.

    default:
      buffer.append(char);

  }
}

Note that while this provides an example, there is a bit more to CSV files which would have to be accounted for (like embedded quotes within quotes, or whether it is appropriate to strip outer quotes in your example).

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • you are talking in .NET lingo eh? I think, `parsedLine.addField()`stuff doesnt work on Linux – Soham Feb 15 '11 at 21:30
  • Your sample code is in C++ not C, but the concept is still sound. – bta Feb 15 '11 at 21:32
  • 1
    Yes, I agree. State Machines is the way to go, I was just thinking how to implement it "nicely/sweetly" in Linux C. – Soham Feb 15 '11 at 21:38
  • I think a break is missing at the end of the *first* case. – CAFxX Feb 15 '11 at 21:45
  • Thank you all for the comments. Yes a break is missing in the first case, it is added now. Also, parsedLine.addField(...) was pseudocode for "do whatever you need to do, or store the result". If it looks like .NET code, that's a coincidence. – Edwin Buck Feb 16 '11 at 10:34
1

A quick and dirty solution if you don't want to add external libraries would be converting the double quotes to \0 (the end of string marker), then parsing the three strings separately using sscanf. Ugly but should work.

Assuming the input is well-formed (otherwise you'll have to add error handling):

for (i=0; str[i]; i++)
  if (str[i] == '"') str[i] = 0;
str += sscanf(str, "%c,%d,%d/%d/%d %d:%d:%d.%d,", &var1, &var2, ..., &var9);
var10 = str; // it may be str+1, I don't remember if sscanf consumes also the \0
sscanf(str+strlen(var10), ",%c,%f,%d,%f,%d", &var11, &var12, ..., &var15);

You will obviously have to make a copy of var10 if you want to free str immediately.

CAFxX
  • 28,060
  • 6
  • 41
  • 66
1

This is a function to get the next single CSV field from an input file supplied as a FILE *. It expects the file to be opened in text mode, and supports quoted fields with embedded quotes and newlines. Fields longer than the size of the supplied buffer are truncated.

int get_csv_field(FILE *f, char *buf, size_t size)
{
    char *p = buf;
    int c;
    enum { QS_UNQUOTED, QS_QUOTED, QS_GOTQUOTE } quotestate = QS_UNQUOTED;

    if (size < 1)
        return EOF;

    while ((c = getc(f)) != EOF)
    {
        if ((c == '\n' || c == ',') && quotestate != QS_QUOTED)
            break;

        if (c == '"')
        {
            if (quotestate == QS_UNQUOTED)
            {
                quotestate = QS_QUOTED;
                continue;
            }

            if (quotestate == QS_QUOTED)
            {
                quotestate = QS_GOTQUOTE;
                continue;
            }

            if (quotestate == QS_GOTQUOTE)
            {
                quotestate = QS_QUOTED;
            }
        }

        if (quotestate == QS_GOTQUOTE)
        {
            quotestate = QS_UNQUOTED;
        }

        if (size > 1)
        {
            *p++ = c;
            size--;
        }
    }

    *p = '\0';

    return c;
}
caf
  • 233,326
  • 40
  • 323
  • 462
0

How about libcsv from our very own Robert Gamble?

Community
  • 1
  • 1
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • IS there any documentation around it. I am not against using external engines like Ragel, but its just that I have totally greenhorn in Ragel(2-3 hours of reading Ragel code) – Soham Feb 16 '11 at 07:22