How to read an arbitrary long sequence in lex/yacc with a fixed buffer limit?

Question

I would like to parse / store the following sequence: Ann {* some arbitrary long sequence *}

My buffer length (YYLMAX) is set to 8K. Given that the sequence can be arbitrary long, I try to do it this way:

Search for the Ann token followed by {*. Once seen go to a ANN state.
When in ANN state read one character at time until YYLMAX-1. Return this token to YACC which copies/appends it to a C++ string buffer (dynamically increasing).
Repeat Step(2) until *} is seen. At that point set state to INITIAL (0).

For small sequences (less than 8K), my code works OK. However, if the sequence exceeds 8K the first grab works OK, but because it matches the annotation_text rule, the lexer seems to be resetting the state to INITIAL. Thus, as I continue reading the next characters in the sequence, I end up getting a parse error, because the lexer rule ANN[^*] no longer matches. I would like to avoid this and remain in the ANN state until the full sequence is read until *} has been seen. What is the correct way to implement this behavior? A sample example would help.

I don't want to increase the length of YYLMAX because I do not know in advance the maximum length of the sequence inside Ann {* ... *} block.

Thank you for any ideas.

The relevant snippet of my YACC grammar looks like this:

string curr_annotation = "{*";

annotation            : _STIL_ANN _STIL_ANN_LCURLY annotation_text _STIL_ANN_RCURLY
                        { 
                          curr_annotation += "*}";
                          std::cout << "Going to add annotation '" << curr_annotation << "'" << std::endl;
                          p_block->addAnn(curr_annotation,stil_yylineno);
                          curr_annotation = "{*";
                        }
                      ;
                      
annotation_text       : _STIL_ANN_TEXT
                        {
                          std::cout << "appending to annotation" << std::endl;
                          curr_annotation += $1;
                        }
                      | annotation_text _STIL_ANN_TEXT
                      ;

The relevant snippet of my LEXER looks like this:

WHITESPACE      [ \t\n\r]
START_BLK_ANN   {WHITESPACE}*"{*"
END_BLK_ANN     "*}"
...

<ANN>{START_BLK_ANN}           { printf("match1\n");
                                 TRACE(("<ANN>START_BLK_ANN <%s> \n",yytext));
                                 return(token(_STIL_ANN_LCURLY));
                               }
<ANN>{END_BLK_ANN}             { printf("match2\n"); TRACE(("<ANN>END_BLK_ANN <%s> \n",yytext));
                                 BEGIN 0;
                                 return(token(_STIL_ANN_RCURLY));
                               } 
<ANN>[^*]                      { printf("match3\n"); yylval.string = stiltok_grab_annotation();
                                 TRACE(("ANN_TEXT <%s> \n",yylval.string));
                                 return(token(_STIL_ANN_TEXT));
                               } 

{IDENTIFIER}                   {
                                 printf("got here\n"); 
                                 temp_sym_name = yytext;
                                 tokVal = stiltoktbl.locateToken(temp_sym_name);
                                 yylval.string = yytext;
                                 if (tokVal != UNDEFINED_TOKEN) {
                                   if (tokVal == _STIL_USER_KEYWORD) {
                                     BEGIN USERKW;
                                     TRACE(("USER_KEYWORD <%s> \n",yytext));
                                   }
                                   else if (tokVal == _STIL_Include)  {
                                     TRACE(("INCLUDE \n"));
                                     BEGIN INCLUDE;
                                     return(token(_STIL_INCLUDE));
                                   }
                                   else if (tokVal == _STIL_Ann) {
                                     TRACE(("ANN \n"));
                                     BEGIN ANN;                   <== Ann Token seen switch state.
                                     return(token(_STIL_ANN));
                                   }
                                   else {
                                     TRACE(("KEYWORD <%s> \n",yytext));
                                   }
                                   return(tokVal);
                                 }
                                 else {
                                   TRACE(("IDENTIFIER <%s> \n",yytext));
                                   return(token(_STIL_IDENTIFIER));
                                 }

The stiltok_grab_annotation looks like this:

char * stiltok_grab_annotation() {

  TRACEID("stiltok_grab_annotation",SEV_4);

  char * p_ann_begin = yytext;  // point past "{*" prefix
  char * p_ann_end   = p_ann_begin + 1;
  char c1 = yyinput();
  char c2 = yyinput();
  *(p_ann_end++) = c1;
  *(p_ann_end++) = c2;
  while ( c1 != '*' || c2 != '}') {
    c1 = c2;
    c2 = yyinput();
    *(p_ann_end++) = c2;
    if ((p_ann_end - yytext) == (MAX_TOKEN_LENGTH-1)) {
       break;
    }
  }
  *(p_ann_end) = '\0';  // place the string terminator at end after "*}" suffix
  
  // Return '*}' token to the lexer  
  if ( c1 == '*' && c2 == '}' ) {
    *(p_ann_end-2) = '\0';
    yyunput(c2, yytext);
    yyunput(c1, yytext);
    
  }

  return p_ann_begin;
}

sample run (just the relevant pieces):

match1            <-- in ANN state, matched '{*'
Line:[58] {*
match3            <-- in ANN state, matched single character. run stiltok_grab_annotation()...
Line:[58] ......  <-- received 8K-1 characters.

appending to annotation     <-- successfully copied by yacc to string buffer
got here                    <-- lexer has gone back to IDENTIFIER state instead of remaining in ANN state
Line:[58]D_REG55_0_18_0     <-- trouble begins
-S- Parsing error detected in file: x [stilcomyacc_perror]
-T- Syntax Error [Line: 58]

If you're using flex, `YYLMAX` is only used if you use `%array`, and there is no good reason to do that (and lots of good reasons not to, including the fact that the `%array` lexer is significantly slower). Unless you use `REJECT` (again, something to be avoided), flex is very capable of managing its memory, and will probably do a better job that you will. So I'd suggest that you just let flex handle it. — rici, Aug 16 '20 at 03:45
Also, if you want to stay in `ANN` state, why do you call `BEGIN 0` (which really should be `BEGIN INITIAL`)? — rici, Aug 16 '20 at 03:47

How to read an arbitrary long sequence in lex/yacc with a fixed buffer limit?

0 Answers0