Figuring out Flex (lexer) yy_push_state

Question

What would the Regex equivalent be of the following Flex structure? I'm trying to recreate Rusts grammar for a project but right now I'm stuck on this piece? This is the grammar for an inner/outer documentation comment (Rust has six types of comments). It should match comments like /** */ and /*! */ but for example I don't understand why [^*] is needed on the first line and what the order of matching is in this case.

\/\*(\*|\!)[^*]       { yy_push_state(INITIAL); yy_push_state(doc_block); yymore(); }
<doc_block>\/\*       { yy_push_state(doc_block); yymore(); }
<doc_block>\*\/       {
    yy_pop_state();
    if (yy_top_state() == doc_block) {
        yymore();
    } else {
        return ((yytext[2] == '!') ? INNER_DOC_COMMENT : OUTER_DOC_COMMENT);
    }
}
<doc_block>(.|\n)     { yymore(); }

As far as I understand: line 1, matches the start /** or /*!; line 2, matches a block comment (for some reason?); line 3, matches the end */; line 11, matches any character or a newline (why?).

Two lines further it also matches for the normal block comment. Why is it also matching for it inside the doc comment?

\/\*                  { yy_push_state(blockcomment); }
<blockcomment>\/\*    { yy_push_state(blockcomment); }
<blockcomment>\*\/    { yy_pop_state(); }
<blockcomment>(.|\n)   { }

rici · Accepted Answer · 2017-04-06T06:52:14.537

The flex state stack allows lexical analysis of strings which cannot be described by a regular expression, so there is no regular expression equivalent to that flex specification. For documentation of the state stack, including the syntax for writing state-contingent rules, see the flex manual.

Rust is infamously badly documented, and the comment syntax(es) fall into that category. The rust book mentions block comments in the syntax index but fails to document the precise syntax in the referenced comments section. I couldn't find any precise description of the syntax understood by rustdoc, either.

I've reverse-engineered the syntax from the flex excerpt you cite, but take it with a grain of salt; it may have only a passing resemblance to the actual syntax accepted by rustc and rustdoc:

Rust block comments, unlike C or C++ block comments, can be nested. That makes them parenthetic syntaxes, which are not regular; they require a pushdown automaton to parse. So no regular expression can describe Rust block comments, and it is necessary to resort to a flex state stack to recognize them.
Rust documentation block comments must start with a slash and precisely two stars (or a star and an exclamation point). A documentation box:
```
/*************************************
 *        START OF SECTION           *
 *************************************\
```
is not considered a documentation comment.

(I suspect that not recognizing inner block comments starting `/!' was an oversight, but who knows.)

If the above is correct, it is possible to answer your questions:

"I don't understand why [^*] is needed on the first line"

This is to avoid matching box comments, as noted above.
"what the order of matching is in this case."

In all cases, flex selects the longest possible match at any point in the input, and if more than one rule matches the same longest string, it selects the first rule in the file. This is the so-called "maximal munch" rules. So given the two rules (which I wrote without the forest of leaning timber because I find it unreadable):
```
"/*"[*!][^*]     {  DocComment(); }
"/*"             {  BlockComment(); }
```
the second rule will apply to the inputs /* Comment and /****, matching two characters, whereas the first rule will apply to /** Documentation comment, matching four characters. (It will also incorrectly apply to /**/, which IMHO should be analyzed as an empty block comment rather than the start of a documentation comment.)
" line 11, matches any character or a newline (why?)"

Yes, it does. If it didn't match any character, that character would not be matched by any rule, which would be incorrect.
"Two lines further it also matches for the normal block comment. Why is it also matching for it inside the doc comment?"

Because the match inside the doc comment only applies inside doc comments. A block comment not inside a doc comment also needs to be matched. However, it is certainly the case that some refactoring is possible here, which could simplify the lexical description.

First of all, thank you for your nicely written explanation. No, I did not use code from Bleibig's repo. I'm not even sure what he is trying to achieve. I got this from the original Rust repo https://github.com/rust-lang/rust/tree/master/src/grammar . I think that that would be the most representative of the code base? Your answer cleared a bit of my confusion up. — Adrian Z., Apr 06 '17 at 06:21
OK, I'll remove the reference to Bleibig's repo. I don't know what the status of the file you cite is, but it differs in comment handling from the Antlr grammar (https://github.com/rust-lang/rust/blob/master/src/grammar/RustLexer.g4) which the README file says is authoritative (but not actually used by rustc). — rici, Apr 06 '17 at 06:51
@AdrianZ.: what confusion was I not able to help you with? :) — rici, Apr 06 '17 at 06:53
I'm using the Bison parser https://github.com/rust-lang/rust/blob/master/src/grammar/parser-lalr.y . The remaining confusion lies within the language that I'm building this in, but that's a topic for another post. — Adrian Z., Apr 06 '17 at 06:56
@AdrianZ.: Afaics, those bison and flex files come from Bleibig's repo, and are not the official parser, which is in `libsyntax`. See https://github.com/rust-lang/rust/commit/4e4e8cff1697ec79bcd0a1e45e63fb2f54a7ea28#diff-196eabc37136edcb65cd83da9f720ade and https://github.com/rust-lang/rust/issues/2234 (But I don't track rust development since years ago, so I have no idea how all the pieces fit together, sorry.) — rici, Apr 06 '17 at 07:10
Interesting. Well, it seems to be a good reflection of the grammar, right? I'm not sure what to look at in the libsyntax. — Adrian Z., Apr 06 '17 at 07:28
I think the actual comment parser is in https://github.com/rust-lang/rust/blob/master/src/libsyntax/parse/lexer/comments.rs but that's really a poster child for why parser generators are better :-). Maybe I'll look at it tomorrow. — rici, Apr 06 '17 at 07:38
Looks cool. I actually forgot that Rust was busy with bootstrapping itself. I'm going to have to do with the bison file tho. — Adrian Z., Apr 06 '17 at 07:42

Figuring out Flex (lexer) yy_push_state

1 Answers1