How can I make a yacc rule that recognizes a comment in C?

Question

I'm trying to make a rule in yacc that recognizes comments. For this, I have defined one token for the comments of only 1 line an another one for the multi line ones; in the lex file:

comentario              [//]([^"])* [\n]
comentarioMulti         [/*]([^"])*[*/]

And also a rule for it in the yacc file :

comentario:                     COMENTARIO
                                COMENTARIMULTILINEA
                                ;

But it gives me this error:

 syntax error en la línea 26
 COMENTARIO MacBook-Air-de-administrador:Sintactico administrador$

I have also tried by putting the \n without the [] and some other options, but I get the same error every time.

Typically the lexer will just tell the parser that it's a single white space (because the C standard says a comment is equivalent to a space character). — Jerry Coffin, Jun 11 '23 at 00:32
Err, no, typically the lexer will just *ignore* the comment as yet more whitespace. The parse isn't concerned with whitespace in any way. @JerryCoffin — user207421, Jun 11 '23 at 02:27
@user207421: I'm not going to get in an argument over terminology. Whitespace remains significant up through phase 6 of translation. And yes, phase 7 includes what you'd traditionally call parsing. But then again, the macro expansion you do in phase 4 involves parsing of a sort. I don't particularly care whether you ignore or use a different name for the parsing necessary for phase 4 though. — Jerry Coffin, Jun 11 '23 at 07:24
@Jerry I don't know where you got your phase numbers from. Frank DeRemer taught me scanning, screening, parsing, ... He also taught that whitespace is only significant in that it separates tokens. The parser doesn't know about it. Whitespace does not appear in grammars: QED. And this is not about 'terminology' unless you are using the terms differently from everyone else, in which case all you have to do is stop. — user207421, Jun 11 '23 at 07:43
Does this answer your question? [Detecting and skipping line comments with Flex](https://stackoverflow.com/questions/25395251/detecting-and-skipping-line-comments-with-flex) and [Unix Flex Regex for Multi-Line Comments](https://stackoverflow.com/questions/4755956/unix-flex-regex-for-multi-line-comments) — Piotr Siupa, Jun 11 '23 at 07:53
I got my phase numbers from the C++ standard (or technically, a draft--N4950). The `[lex.phases]` section, to be precise. Applying normal terminology to C++ is...somewhat fraught. At least to me, the choice between second-hand information from Frank DeRemer and information directly from the C++ standard about the proper terminology and/or procedures for parsing C++ seems obvious, but I suppose some may disagree about that as well. — Jerry Coffin, Jun 11 '23 at 15:43
@JerryCoffin [lex.phases] says that comments are identified at phase 3, and #7 says that 'whitespace characters separating tokens are no longer significant' before syntax analysis, which agrees with what I said, and Frank, and not with what you stated above. Phases before 7 are lexical analysis, no two ways about it. — user207421, Jun 12 '23 at 04:37
@user207421: 'Nice job of proving you've never implemented phase 4, without saying so directly. Worse, you claim to have read the specification, and apparently failed to even get an inkling of what's involved. Function-like macros, in particular, require real parsing to recognize the macro name, arguments, expansion characters, execute operators in the expansion, etc. I've written interpreters for complete programming languages that had simpler parsers. — Jerry Coffin, Jun 12 '23 at 05:27
@JerryCoffin Nice job of changing the subject. I quoted from *your citation* proving that it doesn't state what you claimed, i.e. that 'white space remains significant up through phase 6 of translation'. The actual wording says exactly the opposite. I didn't claim to have read the whole specification, and BTW what I am quoting from De Remer isn't 'third hand' either. Stick to the point please. — user207421, Jun 12 '23 at 07:18
@user207421: You didn't prove anything of the sort. In fact, the first sentence of phase 7 is: "Whitespace characters separating tokens are no longer significant." Pretty good clue that they're significant through phase 6, now isn't it? — Jerry Coffin, Jun 12 '23 at 07:46
@JerryCoffin Oops, yes , I misquoted myself. It doesn't change the fact that the parser doesn't get whitespace. Considering the preprocessor to be another parser doesn't get us far unless there is some evidence the OP is implementing a preprocessor. — user207421, Jun 12 '23 at 08:06
@user207421: the fact that he's planning to read code that includes comments is pretty strong evidence that he's reading source code that hasn't been preprocesed, so his lexer/parser have to deal with preprocessing. — Jerry Coffin, Jun 12 '23 at 08:25

Chris Dodd · Answer 1 · 2023-06-11T06:11:03.850

Normally you would have your lexer recognize comments and simply ignore them, so the parser will never see them at all. To do that you need lex patterns to match the comments with no actual action attached (so they do NOT return a token). Something like:

"//".*                                /* match C++ style comment */
"/*"[^*]*"*"("*"|[^*/][^*]*"*")*"/"   /* match C style comment */

If you want to actually return them to the parser, you can, but then they can only appear in the specific places you allow for them in the parser.

Perhaps what you want to do is look for certain things in comments (certain keywords perhaps) and do something with them, while at the same time parsing things normally (ignoring the comments). In that case, a useful tool is to use lexer start states to deal with that

%x c1 c2

%%

"//"              { BEGIN(c1); /* start C++ comment */ }
"/*"              { BEGIN(c2); /* start C comment */ }
<c1>\n            { BEGIN(0);  /* end C++ comment */ }
<c2>"*/"          { BEGIN(0);  /* end C comment */ }
<c1,c2>.          ;
<c2>\n            ;
<c1,c2>"keyword"  { printf("found 'keyword' in comment"); }

It's also useful to use a lexer state for comments for another reason. Many lexer generators like lex and flex limit what they recognize with a single regex to what will fit in a single input buffer. Most keywords (and such) are short enough that this doesn't matter. But comments (and string literals) that exceed that size limit are pretty common, so trying to recognize either with a single regex can lead to problems. — Jerry Coffin, Jun 11 '23 at 16:40

Piotr Siupa · Accepted Answer · 2023-06-12T12:38:04.477

When a regex doesn't work as expected, I recommend to test it piece by piece to see if they do what is expected. For example, if you'd checked what [//] does, you would find out that it matches / instead of //. (There are also regex visualizers online that can help.)

Let's go over problems with your code:

[ and ] aren't meant to escape special characters. They do that as a side effect but their main purpose is to create a character class. [/*] doesn't match /*. It matches either / or *. To escape characters use either quotation marks ("/*") or backslashes (\/\*). If you really want to use square brackets, you need to put each character into a separate pair ([/][*]).
Your first regex contains a space. Unescaped spaces are not allowed. (Unless you use the special group that allows and ignores them (?x: ... ).)
I don't understand why do you not allow quotation marks in your comments. ([^"] means "everything but "".) If you have your reason that's OK but it seems suspicious to me.
The regex for multi-line comment needs to be a lot more complicated if you don't want it to match things like /* this is a comment */ some code /* another comment */. You can check the links I've commented under the question, to see how to write it correctly.

How can I make a yacc rule that recognizes a comment in C?

2 Answers2