14

I'm trying to parse C-style multi-line comments in my flex (.l) file:

%s ML_COMMENT
%%

...

<INITIAL>"/*"                   BEGIN(ML_COMMENT);
<ML_COMMENT>"*/"                BEGIN(INITIAL);  
<ML_COMMENT>[.\n]+              { }

I'm not returning any token and my grammar (.y) doesn't address comments in any way.

When I run my executable, I get a parse error:

$ ./a.out
/*
abc 
def
Parse error: parse error
$ echo "/* foo */" | ./a.out
Parse error: parse error

(My yyerror function does a printf("Parse error: %s\n"), which is where the first half of the redundant error message comes from).

I can see why the second example fails since the entirety of the input is a comment, and since comments are ignored by the grammar, there are no statements. Thus the input isn't a valid program. But the first part throws a parse error before I even finish the comment.

Also confusing:

$ ./a.out
/* foo */
a = b;
Parse error: parse error

In this case, the comment is closed prior to actual valid input (which, without the comment, parses just fine). The failure actually occurs after parsing "a", not after attempting to parse the assignment "a = b;". If I enter "a" on its own line, it still throws an error.

Given that the error message is a parser error and not a scanner error, is there something crucial I'm missing in my .y file? Or am I doing something wrong in my scanner rules that propagates over to the parser side?

EDIT: Per @Rudi's suggestion, I turned on debugging and found:

$ ./a.out
Starting parse
Entering state 0
Reading a token: /*
foo
Next token is 44 (IDENTIFER)
Shifting token 44 (IDENTIFER), Entering state 4
Reducing via rule 5 (line 130), IDENTIFER  -> identifier
state stack now 0
Entering state 5

I turned off debugging and found that /* foo */ = bar; indeed parses the same as foo = bar;. I'm using flex 2.5.4; it doesn't give me any warnings about the stateful rules I'm attempting to use.

Lesmana
  • 25,663
  • 9
  • 82
  • 87
adelarge
  • 143
  • 1
  • 5
  • 1
    I retagged flex to gnu-flex. Your scanner rules look okay. The parse error indicates invalid token input to the parser. You might want to post some corresponding Bison rules. Additionally, it might be a good idea to put printf() statements inside your bison rules, this way you can see what rules the parser is trying during the scanning of the token. – Kizaru Nov 10 '10 at 14:40
  • 2
    It would also be a good idea to create a separate test harness for your scanner. That way you can isolate scanner defects from parser defects. Any scanner-parser system is complex enough that you don't need to inject additional complexity by performing integration testing when what you really want is to be performing unit testing... – bstpierre Nov 10 '10 at 15:11
  • 1
    When you add the `--debug` flag to your bison invocation and set `yydebug=1` before the `yyparse()` call, then the parser emits debug information for every token it sees from the lexer. – Rudi Nov 10 '10 at 15:14
  • I would suggest - pending reasons not to - just writing a Perl preprocessor to munch comments. – Paul Nathan Nov 10 '10 at 16:58

4 Answers4

5

Parsing comments this way can lead to errors because:

  • you need to add conditions to all of your lex rules
  • it becomes even more complex if you also want to handle // comments
  • you still have the risk that yacc/bison merges two comments including everything in between

In my parser, I handle comments like this. First define lex rules for the start of the comment, like this:

\/\*     {
         if (!SkipComment())
            return(-1);
         }

\/\/     {
         if (!SkipLine())
            return(-1);
         }

then write the SkipComment and SkipLine functions. They need to consume all the input until the end of the comment is found (this is rather old code so forgive me the somewhat archaic constructions):

bool SkipComment (void)
{
int Key;

Key=!EOF;
while (true)
   {
   if (Key==EOF)
      {
      /* yyerror("Unexpected EOF within comment."); */
      break;
      }
   switch ((char)Key)
      {
      case '*' :
         Key=input();
         if (char)Key=='/') return true;
         else               continue;
         break;
      case '\n' :
         ++LineNr;
         break;
      }
   Key=input();
   }

return false;
}

bool SkipLine (void)
{
int Key;

Key=!EOF;
while (true)
   {
   if (Key==EOF)
      return true;
   switch ((char)Key)
      {
      case '\n' :
         unput('\n');
         return true;
         break;
      }
   Key=input();
   }

return false;
}
Patrick
  • 23,217
  • 12
  • 67
  • 130
  • 1
    Does this handle the comment start/end character sequence if it occurs within quoted text? (e.g. `foo = "this doesn't contain a /* comment */"`) – Dan Moulding Nov 10 '10 at 17:00
  • I didn't explicitly mention this, but you have to parse strings exactly the same way. You especially have to do this if you want to support escaping backslashes like in C/C++. – Patrick Nov 10 '10 at 22:39
  • 1
    This is more complex, more error prone, more verbose, and harder to do than just using flex start states properly. Its basically just hand-writing part of your lexer -- if you don't like flex, why not just hand-write the whole thing? – Chris Dodd Mar 18 '16 at 14:28
5

I think you need to declare your ML_COMMENT start condition as an exclusive start condition so only the ML_COMMENT rules are active. %x ML_COMMENT instead of %s ML_COMMENT

Otherwise rules with no start conditions are also active.

Craig
  • 4,750
  • 22
  • 21
  • Ah! This seems to have done the trick. My only question now is: why are the contents of my multi-line comments echoed back? When I type `/* foo bar */` into STDIN, I get `foo bar` printed to STDOUT. – adelarge Nov 10 '10 at 17:24
  • 2
    [.\n] isn't doing what you think its doing. replace it with 2 rules one for . and one for \n. Flex by default echos input that does not match any rule. That is why many lex rule sets end with "." so every input matches something. – Craig Nov 10 '10 at 18:07
1

I found this description of the C language grammar (actually just the lexer) very useful. I think it is mostly the same as Patrick's answer, but slightly different.

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Mark Lakata
  • 19,989
  • 5
  • 106
  • 123
1

Besides the problem with %x vs %s, you also have the problem that the . in [.\n] matches (only) a literal . and not 'any character other than newline' like a bare . does. You want a rule like

<ML_COMMENT>.|"\n"     { /* do nothing */ }

instead

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226