2

I'm writing a function which is designed to take a text like this:

%%HEADER
foo bar baz
%%BODY
foo baz baz

And return an array like this:

{"foo bar baz", "foo baz baz"}

With that in mind, I wrote the following:

string[2] separate_header_body(string input) {
  string[2] separated;
  auto header = matchFirst(input, regex(r"%%HEADER\n((.|\n)*)\n%%BODY", "g"));
  if (header.empty()) {
    throw new ParserException("No %%HEADER found.");
  } else {
    separated[0] = header.front();
    auto bdy = matchFirst(input, regex(r"%%BODY\n((.|\n)*)", "g"));
    if (bdy.empty()) {
      throw new ParserException("No %%BODY found.");
    } else {
      separated[1] = bdy.front();
    }
  }
  return separated;
}

However, when I try to test it with the following input:

"%%HEADER\nfoo bar baz\n%%BODY\nfoo baz baz"

The first capture is "%%HEADER\nfoo bar baz\n%%BODY, which is clearly too much. Am I using std.regex incorrectly for what I want?

Koz Ross
  • 3,040
  • 2
  • 24
  • 44
  • you need to match the second generally when you use REGEX the first capture group the entire capture, `index 1` is your actual first capture. – progrenhard Jul 24 '14 at 23:52

2 Answers2

2

You can use:

auto m = matchFirst(input, regex(r"%%HEADER\n(.*?)\n%%BODY", "s"));
string header = m.captures[1];

and

auto m = matchFirst(input, regex(r"%%BODY\n(.*)", "s"));
string body = m.captures[1];

the s modifier allows the dot to match the newline character.

The question mark ? makes the * quantifier lazy. Then it stops at the first occurence of "\n%%BODY"

Note that it is possible to extract the two fields in one shot with a pattern like this:

auto m = matchFirst(regex(r"%%HEADER\n(.*?)\n%%BODY\n(.*)", "s"));

string header = m.captures[1];
string body = m.captures[2];

if (header.empty()) {
    throw new ParserException("No %%HEADER found.");
} else {
    separated[0] = header;
}

if (body.empty()) {
    throw new ParserException("No %%BODY found.");
} else {
    separated[1] = body;
}

You only need to extract the two capturing groups.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
2

The first element (front) of a capture group is the full match, from the first character of your pattern to the last. After that comes the first submatch, i.e. the first parenthesized section.

So, use [1] instead of front. Or popFront the full match away so that front is the first submatch.

user3874020
  • 111
  • 1