0

Possible Duplicate:
Boost Spirit QI slow

I'm currently experimenting to use Boost Spirit QI for CSV data parsing.

I'm running my tests with a 1GB CSV file. Each row looks like

123|123|\n

and the file has about 60Mio. rows.

I later want to generate parsers for arbitrary CSV files for which I know the data types of the columns. So I first did a test to parse the rows as integer rows (vector of a struct that consists of two ints) using the grammar:

csv = *(int_ >> lit('|') >> int_ >> lit('|'));

In about 2 seconds the file is parsed and the vector is filled. For my benchmarks I first loaded the CSV file to a std::string in memory (so loading the file from disk does not influence performance).

Now I tried the same but interpreted the first column as a string column (parsing into a vector of a struct that consists of a std::string and an int) using the grammar:

csv = *(lexeme[*~char_('|')] >> lit('|') >> int_ >> lit('|'));

The parsing now takes 12 seconds and memory consumption is skyrocketing. I made sure that I am not swapping (swapoff) - so this is not the bottleneck. I was wondering why the parsing of delimited strings is so inefficient in Spirit QI. I mean casting should be more expensive than memcpying the parsed string. Is there a better way?

Update: The delimited strings clearly seem to be the performance bottleneck. Interpreting the row as a double and an int together with the grammar csv = *(double_ >> lit('|') >> int_ >> lit('|')); also parses the file in 2 sec..

Update 2 (Code):

namespace test
{
    namespace qi = boost::spirit::qi;
    namespace ascii = boost::spirit::ascii;

    struct row
    {
        std::string a;
        int b;
    };
}

BOOST_FUSION_ADAPT_STRUCT(
    test::row,
    (std::string, a)
    (int, b)
)

namespace test
{
    template <typename Iterator>
    struct row_parser : qi::grammar<Iterator, std::vector<row>(), ascii::space_type>
    {
        row_parser() : row_parser::base_type(start)
        {
            using qi::int_;
            using qi::lit;
            using qi::double_;
            using qi::lexeme;
            using ascii::char_;

            start = *(lexeme[*~char_('|')] >> lit('|') >> int_ >> lit('|'));
        }

        qi::rule<Iterator, std::vector<row>(), ascii::space_type> start;
    };
}

using boost::spirit::ascii::space;
typedef std::string::const_iterator iterator_type;
typedef test::row_parser<iterator_type> row_parser;

std::vector<test::row> v;

row_parser g;
std::string str;

string dataString(buffer, fileSize); // buffer contains the CSV file contents, fileSize is its size

auto startBoostParseTime = chrono::high_resolution_clock::now();

string::const_iterator iter = dataString.begin();
string::const_iterator end = dataString.end();

phrase_parse(iter, end, g, space, v);

auto endBoostParseTime = chrono::high_resolution_clock::now();

auto boostParseTime = endBoostParseTime - startBoosParseTime ;
cout << "Boost Parsing time: " << boostParseTime.count()<< " result size "<< v.size() <<endl;
Community
  • 1
  • 1
muehlbau
  • 1,897
  • 13
  • 23
  • What are you doing with the result? "Memory usage high" tells me you are accumulating the entire file contents into an attribute. The bottle neck is most likely _there_. – sehe Nov 13 '12 at 12:38
  • I'm taking a std::string and want to generate a std::vector where Row is a struct, e.g., struct Row { std::string a, int b }; I know that I essentially duplicate the memory footprint of the string (or even more than that depending on the data types). However the memory consumption while in parse_phrase is much higher than that. The resulting vector however is correct. – muehlbau Nov 13 '12 at 12:42
  • would you mind sharing the actual code? I'd like to do some heap profiling on that – sehe Nov 13 '12 at 15:02
  • Sure, I attached the code. Mind that I read the CSV data from a file. You could create sample data on the fly. – muehlbau Nov 13 '12 at 15:50

0 Answers0