1

I am having trouble correctly parsing a CSV file. Some of the values in the data rows can be blank, and my code does not work correctly when I have blank entries in any of the value rows. Without blank entries, the program returns the following results:

Symbol: GOOG
Name: Googl Inc.
Price: $570.25
High Today: $570.25
Low Today: $560.35

Symbol: APPL
Name: Apple Inc.
Price: $123.25
High Today: $124.25
Low Today: $125.35

If I run the same program with the following CSV string the program stops with an assertion error. This is due to the parser skipping over adjacent ,, delimiters and as a result the number of colums in the data row does not match that from the header.

std::stringstream ifs(
    "Symbol,Name,Price,High Today,Low Today\n"
    "GOOG,Googl Inc.,$570.25 ,$570.25 ,$560.35\n"
    "APPL,Apple Inc.,$123.25 ,,$125.35\n");

Here is my code:

#include <iostream>
#include <vector>
#include <sstream>
#include <fstream>
#include <algorithm>
#include <cassert>
#include <locale>

// This ctype facet classifies commas and endlines as whitespace
struct csv_whitespace : std::ctype<char> {
    static const mask* make_table() {
        // make a copy of the "C" locale table
        static std::vector<mask> v(classic_table(), classic_table() + table_size);
        v[','] |= space;        // comma will be classified as whitespace
        v[' '] &= ~space;       // space will not be classified as whitespace
        return &v[0];
    }
    csv_whitespace(std::size_t refs = 0)
        : ctype(make_table(), false, refs)
    {}
};

static int row_end = std::ios_base::xalloc();

std::istream& record(std::istream& is) {
    while (std::isspace(is.peek(), is.getloc())) {
        int c(is.peek());
        is.ignore();
        if (c == '\n') {
            is.iword(row_end) = 1;
            is.setstate(std::ios_base::failbit);
        }
    }
    return is;
}

template<class Iter1, class Iter2, class Function>
void for_each_binary_range(Iter1 first1, Iter1 last1,
    Iter2 first2, Iter2 last2, Function f)
{
    assert(std::distance(first1, last1) <=
        std::distance(first2, last2));
    while (first1 != last1) {
        f(*first1++, *first2++);
    }
}

int main(int argc, char *argv[])
{
    std::stringstream ifs(
        "Symbol,Name,Price,High Today,Low Today\n"
        "GOOG,Googl Inc.,$570.25 ,$570.25 ,$560.35\n"
        "APPL,Apple Inc.,$123.25 ,$124.25 ,$125.35\n");
    //std::ifstream ifs("c:\\temp\\csvfile.csv", std::ios::in);
    std::vector<std::string> keys, values;
    ifs.imbue(std::locale(ifs.getloc(), new csv_whitespace));
    bool bHeaderProcessed = false;
    for (std::string item;;) {
        if (ifs >> record >> item) {
            if (!bHeaderProcessed) {
                keys.push_back(item);
            } else {
                values.push_back(item);
            }
        } else if (ifs.eof()) {
            // catch case where last line does not have trailing \n
            if (!values.empty()) {
                for_each_binary_range(std::begin(keys), std::end(keys),
                    std::begin(values), std::end(values),
                    [&](std::string const& key, std::string const& value) {
                    std::cout << key << ": " << value << std::endl;
                    std::cout << std::endl;
                });
                values.clear();
            }
            break;
        } else if (ifs.iword(row_end)) {
            // reset eol flag & clear stream state
            ifs.iword(row_end) = 0;
            // clear the fail-bit so we can stream more values
            ifs.clear();
            bHeaderProcessed = true;
            if (!values.empty()) {
                for_each_binary_range(std::begin(keys), std::end(keys),
                    std::begin(values), std::end(values),
                    [&](std::string const& key, std::string const& value) {
                        std::cout << key << ": " << value << std::endl;
                    });
                values.clear();
                std::cout << std::endl;
            }
        } else {
            break;
        }
    }
    return -1;
}

The original code which I based mine on is documented well here. Unfortunately, the answer to the question (with a live demo here) does not seem to handle the case where there are multiple rows and I cannot get it to handle the case where the tokens are empty.

My version prints out each of the rows as a series of name/values and it also handles the case where there are multiple rows or a row not ending on a new line.

The logic is described very well in linked answer above

Could someone point out how to handle the case where I have adjacent delimiters in the data lines in the csv.

Community
  • 1
  • 1
johnco3
  • 2,401
  • 4
  • 35
  • 67
  • We need to see your code to help you with it. After a full day of answering questions our psychic powers are a bit weak right now. ;) – soulsabr Dec 01 '15 at 23:29
  • 1
    @johnco3 @ user4581301 No, that is what you call a reading fail. I'm not used to bolded blue links and I guess I glossed them over. I offer my apologies for this error on my part. – soulsabr Dec 01 '15 at 23:34
  • 1
    No worries, @soulsabr . In a moment I'm going to port the important content of my comment to a new comment and delete the old one so we can start from a clean-ish slate. – user4581301 Dec 01 '15 at 23:44
  • folks I fixed the question by in-lining the code to clafity - apologies and good points all round, thanks for the constructive comments – johnco3 Dec 01 '15 at 23:45
  • And the point has been rendered moot. Thank you, @johnco3 – user4581301 Dec 01 '15 at 23:45

1 Answers1

0

Your problem lies in how you are expecting the data to parse vs the way the data is actually parsed. The double ",," is being completely ignored and not pushed onto the values vector. This means the size of your values array is going to be short one or more. The asert fails because the size of the keys is asserted to be <= the size of the values. If this had happened where you had a double ",," in the keys vector you'd be OK.

std::stringstream ifs(
    "Symbol,Name,,High Today,Low Today\n"
    "GOOG,Googl Inc.,$570.25 ,$570.25 ,$560.35\n"
    "APPL,Apple Inc.,$123.25 ,$124.25 ,$125.35\n");

Try using the above and observe the output.

An easy, though inelegant, solution would be to insert a space in between each ",," so the program will pick it up. There are much better solutions to be had but that should get you going.

EDIT : Thanks for understanding.

soulsabr
  • 895
  • 4
  • 16
  • Actually your answer pretty much paraphrases the last sentence of my second paragraph in the question. I was aware ofthe space hack, however that dirty workaround would also require a space at the end of the line if the last value was empty - this is not visually obvious in an editor. – johnco3 Dec 01 '15 at 23:59
  • To be fair you did change the question after I answered. – soulsabr Dec 02 '15 at 15:06
  • Actually that comment was there all along, I changed the question to add the source code. – johnco3 Dec 02 '15 at 16:27