0

I'm trying to read a known number (at runtime) of characters in a Flex lexer. I know it starts with a CRLF, so I match that, then read literal_length characters using yyinput.

<EXPECT_LITERAL>"\r\n"      {
    for(int i=0;i<literal_length;i++){
        int c= yyinput(yyg);
        if(c == EOF) break;
    }
    *yylval = val_new_s(yytext);
    return(LITERAL);
}

But yyinput does not add the new characters, instead it contains:

*yy_c_buf_p = '\0'; /* preserve yytext */
yy_hold_char = *++yy_c_buf_p;

which means that yytext doesn't get the extra literal_length characters. I'd rather not create a new buffer to store them if I can avoid it, because I know the character sequence is already in memory.

Aside from completely redefining yyinput(), is there any way to keep the extra characters in yytext?

splash
  • 13,037
  • 1
  • 44
  • 67
Roderick
  • 1,205
  • 11
  • 24

2 Answers2

0

You are matching the CRLF, so yytext contains CRLF.

If you want to match digits following CRLF, then you need to match the digits:

%x EXPECT_DIGITS

<EXPECT_LITERAL>\r\n    BEGIN(EXPECT_DIGITS); /* ignore otherwise */
<EXPECT_DIGITS>[0-9]*   BEGIN(INITIAL);       /* parse yytext here */ return LITERAL;

That the characters may be read already is an implementation detail you cannot rely on.

You can probably simplify the match a bit more to get away without a special state (for example, you can match \r\n[0-9]*, then the digits are part of yytext already).

Simon Richter
  • 28,572
  • 1
  • 42
  • 64
  • Thanks for the information Simon. But I cannot create a match for an exact number of characters that is not known until runtime. I know that the characters are read because calling yyinput() causes them to be read. We can detect if EOF occurs before the expected number of characters, and YYINPUT can be made to wait if they aren't ready yet. So it is known that the characters are there. I could rewrite yyinput() to NOT destroy the incoming chars, but as this is excluded by the question, I'll accept your answer as a "no". – Roderick Jan 24 '17 at 14:07
  • @Roderick, that's what the asterisk does. The `[0-9]` matches any ASCII digit, the asterisk repeats that match. `yyleng` then tells you how many characters matched. – Simon Richter Jan 24 '17 at 14:10
  • An asterisk gets all the characters it can. The question was to get "literal_length" characters, and only that many. – Roderick Jan 24 '17 at 14:11
  • Ah. That cannot be done with flex, because we leave the regex engine in order to execute user code. I'd enter a special state then in which I match single digits, and accumulate them. – Simon Richter Jan 24 '17 at 14:13
0

You can match the digits in a separate state, and terminate the state when you have all of them:

%{
    uint64_t accumulator;
    unsigned int remaining_digits;
%}

%x EXPECT_DIGITS

<EXPECT_LITERAL>\r\n    BEGIN(EXPECT_DIGITS); remaining_digits = literal_length; accumulator = 0;
<EXPECT_DIGITS>[0-9]    accumulator = accumulator * 10 + *yytext - '0'; if(!--remaining_digits) { BEGIN(INITIAL); *yylval = accumulator; return LITERAL; }
<EXPECT_DIGITS>.        /* handle non-digits */

This needs some more error handling, obviously.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64