How to match anything until a delimiter is encountered in RE-flex lexer?

Question

I was using RE/flex lexer for my project. In that, I want to match the syntax corresponding to ('*)".*?"\1. For eg, it should match "foo", ''"bar"'', but should not match ''"baz"'.

But RE/flex matcher doesn't work with lookaheads, lookbehinds and backreferences. So, is there a correct way to match this using reflex matcher? The nearest I could achieve was the following lexer:

%x STRING

%%

'*\" {
    textLen = 0uz;
    quoteLen = size();
    start(STRING);
}

<STRING> {

\"'* {
    if (size() - textLen < quoteLen) goto MORE_TEXT;
    matcher().less(textLen + quoteLen);
    start(INITIAL);
    res = std::string{matcher().begin(), textLen};
    return TokenKind::STR;
}

[^"]* {
    MORE_TEXT:
    textLen = size();
    matcher().more();
}

<<EOF>> {
    std::cerr << "Lexical error: Unterminated 'STRING' \n";
    return TokenKind::ERR;
}

}

%%

The meta-character . in RE-flex matches any character, be it valid or invalid UTF8 sequence. Whereas the inverted character class - [^...] - matches only valid UTF8 sequences that are absent in the character class.

So, the problem with above lexer is that, it matches only valid UTF8 sequences inside strings. Whereas, I want it to match anything inside string until the delimiter.

I considered three workarounds. But all three seems to have some issues.

Use skip(). This skips all characters till it reaches delimiter. But in the process, it consumes all the string content. I don't get to keep them.
Use .*?/\" instead of [^"]*. This works for every properly terminated strings. But gets the lexer jammed if the string is not terminated.
Use consume string content character by character using .. Since . is synchronizing, it can even match invalid UTF8 sequences. But this approach feels way too slow.

So is there any better approach for solving this?

score 0 · Answer 1 · answered Feb 07 '23 at 16:14

I didn't found any proper way to solve the problem. But I just did a dirty hack with 2nd workaround mentioned above.

Instead of RE/flex generated scanner loop, I added a custom loop inside string begin rule. In there, instead of failing with scanner jammed error, I am flushing remaining text and displaying unterminated string error message.

%x STRING

%%

'*\" {
    auto textLen = 0uz;
    const auto quoteLen = size();
    matcher().pattern(PATTERN_STRING);

    while (true) {
        switch (matcher().scan()) {

        case 1:
            if (size() - textLen < quoteLen) break;
            matcher().less(textLen + quoteLen);
            res = std::string{matcher().begin(), textLen};
            return TokenKind::STR;

        case 0:
            if (!matcher().at_end()) matcher().set_end(true);
            std::cerr << "Lexical error: Unterminated 'STRING' \n";
            return TokenKind::ERR;

        default:
            std::unreachable();

        case 2:;
        }

        textLen = size();
        matcher().more();
    }
}

<STRING>{
\"'* |
.*?/\" |
<<EOF>> std::unreachable();
}

%%

How to match *anything* until a delimiter is encountered in RE-flex lexer?

1 Answers1

How to match anything until a delimiter is encountered in RE-flex lexer?