Possible Duplicate:
Boost Spirit QI slow
I'm currently experimenting to use Boost Spirit QI for CSV data parsing.
I'm running my tests with a 1GB CSV file. Each row looks like
123|123|\n
and the file has about 60Mio. rows.
I later want to generate parsers for arbitrary CSV files for which I know the data types of the columns. So I first did a test to parse the rows as integer rows (vector of a struct that consists of two ints) using the grammar:
csv = *(int_ >> lit('|') >> int_ >> lit('|'));
In about 2 seconds the file is parsed and the vector is filled. For my benchmarks I first loaded the CSV file to a std::string in memory (so loading the file from disk does not influence performance).
Now I tried the same but interpreted the first column as a string column (parsing into a vector of a struct that consists of a std::string and an int) using the grammar:
csv = *(lexeme[*~char_('|')] >> lit('|') >> int_ >> lit('|'));
The parsing now takes 12 seconds and memory consumption is skyrocketing. I made sure that I am not swapping (swapoff) - so this is not the bottleneck. I was wondering why the parsing of delimited strings is so inefficient in Spirit QI. I mean casting should be more expensive than memcpy
ing the parsed string. Is there a better way?
Update: The delimited strings clearly seem to be the performance bottleneck. Interpreting the row as a double and an int together with the grammar csv = *(double_ >> lit('|') >> int_ >> lit('|'));
also parses the file in 2 sec..
Update 2 (Code):
namespace test
{
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
struct row
{
std::string a;
int b;
};
}
BOOST_FUSION_ADAPT_STRUCT(
test::row,
(std::string, a)
(int, b)
)
namespace test
{
template <typename Iterator>
struct row_parser : qi::grammar<Iterator, std::vector<row>(), ascii::space_type>
{
row_parser() : row_parser::base_type(start)
{
using qi::int_;
using qi::lit;
using qi::double_;
using qi::lexeme;
using ascii::char_;
start = *(lexeme[*~char_('|')] >> lit('|') >> int_ >> lit('|'));
}
qi::rule<Iterator, std::vector<row>(), ascii::space_type> start;
};
}
using boost::spirit::ascii::space;
typedef std::string::const_iterator iterator_type;
typedef test::row_parser<iterator_type> row_parser;
std::vector<test::row> v;
row_parser g;
std::string str;
string dataString(buffer, fileSize); // buffer contains the CSV file contents, fileSize is its size
auto startBoostParseTime = chrono::high_resolution_clock::now();
string::const_iterator iter = dataString.begin();
string::const_iterator end = dataString.end();
phrase_parse(iter, end, g, space, v);
auto endBoostParseTime = chrono::high_resolution_clock::now();
auto boostParseTime = endBoostParseTime - startBoosParseTime ;
cout << "Boost Parsing time: " << boostParseTime.count()<< " result size "<< v.size() <<endl;