I was using RE/flex lexer for my project. In that, I want to match the syntax corresponding to ('*)".*?"\1
. For eg, it should match "foo"
, ''"bar"''
, but should not match ''"baz"'
.
But RE/flex matcher doesn't work with lookaheads, lookbehinds and backreferences. So, is there a correct way to match this using reflex matcher? The nearest I could achieve was the following lexer:
%x STRING
%%
'*\" {
textLen = 0uz;
quoteLen = size();
start(STRING);
}
<STRING> {
\"'* {
if (size() - textLen < quoteLen) goto MORE_TEXT;
matcher().less(textLen + quoteLen);
start(INITIAL);
res = std::string{matcher().begin(), textLen};
return TokenKind::STR;
}
[^"]* {
MORE_TEXT:
textLen = size();
matcher().more();
}
<<EOF>> {
std::cerr << "Lexical error: Unterminated 'STRING' \n";
return TokenKind::ERR;
}
}
%%
The meta-character .
in RE-flex matches any character, be it valid or invalid UTF8 sequence. Whereas the inverted character class - [^...]
- matches only valid UTF8 sequences that are absent in the character class.
So, the problem with above lexer is that, it matches only valid UTF8 sequences inside strings. Whereas, I want it to match anything inside string until the delimiter.
I considered three workarounds. But all three seems to have some issues.
- Use
skip()
. This skips all characters till it reaches delimiter. But in the process, it consumes all the string content. I don't get to keep them. - Use
.*?/\"
instead of[^"]*
. This works for every properly terminated strings. But gets the lexer jammed if the string is not terminated. - Use consume string content character by character using
.
. Since.
is synchronizing, it can even match invalid UTF8 sequences. But this approach feels way too slow.
So is there any better approach for solving this?