3

I have thousands of lines of input, that each line consists of 3 ints and a comma at the and that look like this:

5 6 10,
8 9 45,
.....

How can I create a grammar that parses only a certain section of an input, for example first 100 lines or from line 1000 to 1200 and ignores the rest.

My grammar currently looks like this:

qi::int_ >> qi::int_ >> qi::int_ >> qi::lit(",");

But obviously it parses the whole input.

Slava C
  • 51
  • 4

2 Answers2

4

You could just seek up to the interesting point and parse 100 lines there.

A sketch on how to skip 100 lines from just spirit:

Live On Coliru

#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_tuple.hpp>
#include <tuple>

namespace qi = boost::spirit::qi;

int main() {
    using It  = boost::spirit::istream_iterator;
    using Tup = std::tuple<int, int, int>;

    It f(std::cin >> std::noskipws), l;
    std::vector<Tup> data;

    using namespace qi;

    if (phrase_parse(f, l,
            omit [ repeat(100) [ *(char_ - eol) >> eol ] ] >> // omit 100 lines
            repeat(10) [  int_ >> int_ >> int_ >> ',' >> eol ], // parse 10 3-tuples
            blank, data))
    {
        int line = 100;
        for(auto tup : data)
            std::cout << ++line << "\t" << boost::fusion::as_vector(tup) << "\n";
    }

}

When tested with some random input like

od -Anone -t d2 /dev/urandom -w6 | sed 's/$/,/g' | head -200 | tee log | ./test
echo ============== VERIFY WITH sed:
nl log | sed -n '101,110p'

It'll print something expected, like:

101 (15400 5215 -20219)
102 (26426 -17361 -6618)
103 (-15311 -6387 -5902)
104 (22737 14339 16074)
105 (-28136 21003 -11594)
106 (-11020 -32377 -4866)
107 (-24024 10995 22766)
108 (3438 -19758 -10931)
109 (28839 22032 -7204)
110 (-25237 23224 26189)
============== VERIFY WITH sed:
   101    15400   5215 -20219,
   102    26426 -17361  -6618,
   103   -15311  -6387  -5902,
   104    22737  14339  16074,
   105   -28136  21003 -11594,
   106   -11020 -32377  -4866,
   107   -24024  10995  22766,
   108     3438 -19758 -10931,
   109    28839  22032  -7204,
   110   -25237  23224  26189,
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Does omitting 100 first lines actually makes the parsing faster, meaning does not parse those lines, or just ignores them after parsing? I'm looking for ways to shorten parsing times if I only need a partial content. Another one: is using std::tuple to store values somewhat more efficient then storing it into struct with 2 ints? – Slava C Oct 12 '15 at 09:49
  • @SlavaC I dunno. Note the _first sentence_ of my answer! The rest is just indulging your question ("How can I create a grammar that parses only a certain section of an input"). I assume you thought about the question before asking. I don't think it will speed up in this case, because 100 is chump change. In practice it all depends (what grammar, what source iterators, IO latency, ...). – sehe Oct 12 '15 at 12:16
  • I see. I'm starting to get to something more practical with my tests / learning and confronting some real life requirements. I'm trying to read and parse huge (1Gb and up) VRML file, but in many cases I'd need only some of the nodes (hundreds of nodes at the middle of the file) and not the whole content of the file. About the tuple question? Is there a preference? Thanks! You helped me a lot, also with a previous question :) – Slava C Oct 12 '15 at 12:49
  • The tuple will obviously be less efficient than a struct with 2 ints, because it stores 1 int less. And no, there is no other reason (other than you didn't supply the code, so I had to make up a datatype). – sehe Oct 12 '15 at 12:56
  • If you post a "real" question (with sample VRML, actual requirements and what you have got) I would be happy to look at optimizations. Meanwhile also see [this answer](http://stackoverflow.com/a/28341791/85371) and specifically the [linked second approach](http://stackoverflow.com/questions/28217301/using-boostiostreamsmapped-file-source-with-stdmultimap/28220864#28220864). It does raw seeks on a text file and detecting the nearest start-of-line. – sehe Oct 12 '15 at 12:59
2

Just because I want to learn more about Spirit X3, and because the worlds would like to know more about this upcoming version of the library, here's a more intricate version that shows a way to dynamically filter lines according to some expression.

In this case the lines are handled by this handler:

auto handle = [&](auto& ctx) mutable {
    using boost::fusion::at_c;

    if (++line_no % 10 == 0)
    {
        auto& attr = x3::_attr(ctx);
        data.push_back({ at_c<0>(attr), at_c<1>(attr), at_c<2>(attr) });
    }
};

As you'd expect every 10th line is included.

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <iostream>

namespace x3 = boost::spirit::x3;

int main() {
    using It  = boost::spirit::istream_iterator;

    It f(std::cin >> std::noskipws), l;

    struct Tup { int a, b, c; };
    std::vector<Tup> data;

    size_t line_no = 0;

    auto handle = [&](auto& ctx) mutable {
        using boost::fusion::at_c;

        if (++line_no % 10 == 0)
        {
            auto& attr = x3::_attr(ctx);
            data.push_back({ at_c<0>(attr), at_c<1>(attr), at_c<2>(attr) });
        }
    };

    if (x3::phrase_parse(f, l, (x3::int_ >> x3::int_ >> x3::int_) [ handle ] % (',' >> x3::eol), x3::blank))
    {
        for(auto tup : data)
            std::cout << tup.a << " " << tup.b << " " << tup.c << "\n";
    }

}

Prints e.g.

g++ -std=c++1y -O2 -Wall -pedantic -pthread main.cpp -o test
od -Anone -t d2 /dev/urandom -w6 | sed 's/$/,/g' | head -200 | tee log | ./test
echo ============== VERIFY WITH perl:
nl log | perl -ne 'print if $. % 10 == 0'
-8834 -947 -8151
13789 -20056 -11874
6919 -27211 -19472
-7644 18021 13523
-20120 16923 -11419
27772 31149 14005
3540 4894 -24790
10698 10223 -30397
-22533 -32437 -13665
25813 3264 -16414
11453 11955 18268
5092 27052 17930
10915 6493 20432
-14380 -6085 -25430
18599 6710 17279
22049 22259 -32189
1048 14621 6452
-24996 10856 29429
3537 -26338 19623
-4117 6617 14009
============== VERIFY WITH perl:
    10    -8834   -947  -8151,
    20    13789 -20056 -11874,
    30     6919 -27211 -19472,
    40    -7644  18021  13523,
    50   -20120  16923 -11419,
    60    27772  31149  14005,
    70     3540   4894 -24790,
    80    10698  10223 -30397,
    90   -22533 -32437 -13665,
   100    25813   3264 -16414,
   110    11453  11955  18268,
   120     5092  27052  17930,
   130    10915   6493  20432,
   140   -14380  -6085 -25430,
   150    18599   6710  17279,
   160    22049  22259 -32189,
   170     1048  14621   6452,
   180   -24996  10856  29429,
   190     3537 -26338  19623,
   200    -4117   6617  14009,
TemplateRex
  • 69,038
  • 19
  • 164
  • 304
sehe
  • 374,641
  • 47
  • 450
  • 633