19

In JS regular expressions symbols ^ and $ designate start and end of the string. And only with /m modifier (multiline mode) they match start and end of line - position before and after CR/LF.

But in std::regex/ECMAscript mode symbols ^ and $ match start and end of line always.

Is there any way in std::regex to define start and end of the string match points? In other words: to support JavaScript multiline mode ...

Enlico
  • 23,259
  • 6
  • 48
  • 102
c-smile
  • 26,734
  • 7
  • 59
  • 86
  • 1
    The point is that `^` and `$` match the start and end of string. See https://ideone.com/amatBf and https://ideone.com/0D7eS7 – Wiktor Stribiżew Sep 22 '16 at 18:06
  • @WiktorStribiżew Ok, how to modify your samples for `^` and `$` to match start/end of line ? – c-smile Sep 22 '16 at 18:23
  • 1
    I already mentioned: for the end of line, it is `(?=\n|$)`, for the start of line it can only be a consuming pattern like `(^|\n)`. This is very uncomfortable, I know. Switching to Boost regex might turn out the best option if you really need that multiline behavior for `^` / `$`. – Wiktor Stribiżew Sep 22 '16 at 18:24

4 Answers4

7

TL;DR

  • MSVC: the ^ and $ already match start and end of lines
  • C++17: use std::regex_constants::multiline option
  • Other compilers only match start of string with ^ and end of string with $ with no a possibility to redefine their behavior.

In all std::regex implementations other than MSVC and before C++17, the ^ and $ match beginning and end of the string, not a line. See this demo that does not find any match in "1\n2\n3" with ^\d+$ regex. When you add alternations (see below), there are 3 matches.

However, in MSVC and C++17, the ^ and $ may match start/end of the line.

C++17

Use the std::regex_constants::multiline option.

MSVC compiler

In a C++ project in Visual Studio, the following

std::regex r("^\\d+$");
std::string st("1\n2\n3");
for (std::sregex_iterator i = std::sregex_iterator(st.begin(), st.end(), r);
    i != std::sregex_iterator();
    ++i)
{
    std::smatch m = *i;
    std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
}

will output

Match value: 1 at Position 0
Match value: 2 at Position 2
Match value: 3 at Position 4

Workarounds that work across C++ compilers

There is no universal option in std::regex to make the anchors match start/end of the line across all compilers. You need to emulate it with alternations:

^ -> (^|\n)
$ -> (?=\n|$)

Note that $ can be "emulated" fully with (?=\n|$) (where you may add more line terminator symbols or symbol sequences, like (?=\r?\n|\r|$)), but with ^, you cannot find a 100% workaround.

Since there is no lookbehind support, you might have to adjust other parts of your regex pattern because of (^|\n) like using capturing groups more often than you could with a lookbehind support.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I think my wording is a bit hectic, but what I am driving at is that your initial presumptions are wrong. The `^` only matches the beginning of a string, and `$` only matches the end of the string. – Wiktor Stribiżew Sep 22 '16 at 18:14
  • 2
    "The assertion ^ (beginning of line) matches the position that immediately follows a LineTerminator character...." http://en.cppreference.com/w/cpp/regex/ecmascript – c-smile Sep 22 '16 at 18:26
  • 1
    @c-smile: I know what you mean but my answer is based on practical experience. – Wiktor Stribiżew Sep 22 '16 at 18:28
6

By default, ECMAscript mode already treats ^ as both beginning-of-input and beginning-of-line, and $ as both end-of-input and end-of-line. There is no way to make them match only beginning or end-of-input, but it is possible to make them match only beginning or end-of-line:

When invoking std::regex_match, std::regex_search, or std::regex_replace, there is an argument of type std::regex_constants::match_flag_type that defaults to std::regex_constants::match_default.

  • To specify that ^ matches only beginning-of-line, specify std::regex_constants::match_not_bol
  • To specify that $ matches only end-of-line, specify std::regex_constants::match_not_eol
  • As these values are bitflags, to specify both, simply bitwise-or them together (std::regex_constants::match_not_bol | std::regex_constants::match_not_eol)
  • Note that beginning-of-input can be implied without using ^ and regardless of the presence of std::regex_constants::match_not_bol by specifying std::regex_constants::match_continuous

This is explained well in the ECMAScript grammar documentation on cppreference.com, which I highly recommend over cplusplus.com in general.

Caveat: I've tested with MSVC, Clang + libc++, and Clang + libstdc++, and only MSVC has the correct behavior at present.

ildjarn
  • 62,044
  • 9
  • 127
  • 211
  • 2
    From your link cppreference.com The assertion ^ (beginning of line) matches 1) The position that immediately follows a LineTerminator character. (if supported, see LWG issue 2343) 2) The beginning of the input (unless std::regex_constants::match_not_bol(C++ only) is enabled) That's quite different from what is needed. I need `^` to match just "the beginning of the input" and nothing else. – c-smile Sep 22 '16 at 18:07
  • @c-smile : Quite right, I grossly misread it. Answer updated. – ildjarn Sep 22 '16 at 18:40
  • My mental parser fails to parse: "To specify that $ matches only end-of-line, specify std::regex_constants::match_**not**_eol" As for me `match_not_eol` shall mean quite opposite thing: if that flag is set then it shall not match EOL, only end of input, right ? And that really makes sense. In the way you interpret it that flag is useless. – c-smile Sep 23 '16 at 19:06
  • @c-smile : It means "don't treat `first` as BOL or `last` as EOL", _not_ what you want. I linked to the documentation for a reason. ;-] – ildjarn Sep 23 '16 at 19:40
  • Not clear what "first" and "last" mean here. Anyway, the question is: What flags to use for `^` to match only beginning-of-input (same with `$` and end-of-input) ? In boost there are explicit `\A` and `\z` markers that match explicitly head/tail of the input :http://www.boost.org/doc/libs/1_31_0/libs/regex/doc/syntax.html seems like std lost this feature. – c-smile Sep 23 '16 at 20:37
  • @c-smile : `first` and `last` are the iterator-range passed in to the regex algorithm (search, match, replace). I don't think `std::regex` supports what you want with ECMAScript syntax, but POSIX syntax may have what you want. I'm not exhaustively familiar with these, but cppreference.com has links to their grammars. – ildjarn Sep 23 '16 at 20:51
1

The following code snippet matches email addresses starting [a-z] followed by 0 or 1 dot, then by 0 or more a-z letters, then ending with "@gmail.com". I tested it.

string reg = "^[a-z]+\\.*[a-z]*@gmail\\.com$";

regex reg1(reg, regex_constants::icase);
reg1(regex_str, regex_constants::icase);
string email;
cin>>email;
if (regex_search(email, reg1))
Charlie
  • 639
  • 9
  • 19
0

You can emulate Perl/Python/PCRE \A, which matches at beginning of string but not after a newline, with the Javascript regex ^(?<!(.|\n)]), which translates to English as "match the beginning of a line which has no preceding character".

You can emulate Perl/Python/PCRE \z, which matches only at end-of-string, using (?!(.|\n))$. To get the effect of \Z, which matches only at end-of-string but allows a single newline just before that end-of-string, just add an optional newline: \n?(?!(.|\n))$.

Thom Boyer
  • 517
  • 2
  • 12