Input buffer overflow in spite of reading character by character

Question

In order to overcome the issue of input buffer overflow in lex, I wrote code to read the incoming stream character by character whenever I expect to see a long string, however, still get the error input buffer overflow, can't enlarge buffer because scanner uses REJECT

Code snippet:

<STATE> {identifier} {
    string str = yytext; 
    if(str == "ExpectedStr")
       handleLongStr(str);
    copyString(yylval.str, str);
    return IDENTIFIER; 
}

void handleLongStr(string &str)
{
  str.clear();
  char ch;
  while((ch = yyinput()) != '\n')
    str.push_back(ch);
  unput(ch); 
}

If you really *must* use `REJECT`, you can try `#define YY_BUF_SIZE` to be lareger. — Chris Dodd, Sep 30 '15 at 20:43

rici · Accepted Answer · 2015-09-30T20:43:43.113

yyinput uses up buffer space in the buffer, although it doesn't let you recover the data read from yytext. About the only reason for this behaviour that I've ever come up with is that it allows you to unput() as many of the characters as you input() without destroying yytext, which is useful if you're using input() as a way of peeking at the next input.

For whatever the reason, that means that you cannot use yyinput to avoid buffer reallocation. So you need to do the next best thing: handle long tokens in smaller pieces. For example, you could do something like this:

%%
  /* Variable is local to a call to yylex */
  std::string longtoken;

<STATE>{identifier}  {
  /* Personally I'd prefer to use a regex pattern than an if here */
  if (is_long_prefix(yytext)) {
    longtoken.clear();
    BEGIN(STATE_LONG_IDENTIFIER);
  }
  else {
    yylval.str = strdup(yytext);
    return IDENTIFIER;
  }
  // ...
}

<STATE_LONG_IDENTIFIER>{
   /* Here we handle subtokens of up to 100 characters. The number
    * is arbitrary, but the nature of flex is that the resulting DFA
    * will have one state per repetition, and large repetitions create
    * a lot of states.
    */
   .{1,100} { longtoken.append(yytext, yyleng); }
   \n       { yylval.str = strdup(longtoken.c_str(););
              BEGIN(STATE);
              return IDENTIFIER;
            }
   <<EOF>>  { error("Unterminated long identifier"); }
}

Thank you very much rici for getting back. I understood the action being taken in , however I am not sure what is happening in the code in . Can you please elaborate. Where is the scanning of the long string happening? On the line .{1, 100} {longtoken.append(yytext, yyleng);} what is happening? Thank you very much for your inputs. — mickeyj, Sep 30 '15 at 20:46
@mickeyj: That line recognizes from 1 to 100 repetitions of `.` (i.e. any character other than newline). The action appends the recognized characters to `longtoken`. Since the action does not return, the scanner continues its loop, and since the action doesn't change the start condition, it stays in the same state. So it will keep on doing that until it hits a newline or an EOF, at which point one of the other `STATE_LONG_IDENTIFIER` patterns will match. — rici, Sep 30 '15 at 20:53

Input buffer overflow in spite of reading character by character

1 Answers1