Flex, continuous scanning stream (from socket). Did I miss something using yywrap()?

Question

Working on a socketbased scanner (continuous stream) using Flex for pattern recognition. Flex doesn't find a match that overlaps 'array bounderies'. So I implemented yywrap() to setup new array content as soon yylex() detects <> (it will call yywrap). No success so far.

Basically (for pin-pointing my problem) this is my code:

%{

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define BUFFERSIZE 26
                     /*   0123456789012345678901234 */
char cbuf1[BUFFERSIZE] = "Hello everybody, lex is su";  // Warning, no '\0'
char cbuf2[BUFFERSIZE] = "per cool. Thanks!         ";
char recvBuffer[BUFFERSIZE];

int packetCnt = 0;

YY_BUFFER_STATE bufferState1, bufferState2;

%}

%option nounput
%option noinput

%%

"super"                 { ECHO; }
.                       { printf( "%c", yytext[0] );}

%%

int yywrap()
{

  int retval = 1;   

  printf(">> yywrap()\n");

  if( packetCnt <= 0 )    // Stop after 2
  {
    // Copy cbuf2 into recvBuffer
    memcpy(recvBuffer, cbuf2, BUFFERSIZE);

    //
    yyrestart(NULL); // ?? has no effect

    // Feed new data to flex
    bufferState2 = yy_scan_bytes(recvBuffer, BUFFERSIZE); 

    //
    packetCnt++;

    // Tell flex to resume scanning
    retval = 0;   
  }

  return(retval); 
}

int main(void)
{
  printf("Lenght: %d\n", (int)sizeof(recvBuffer)) ;

  // Copy cbuf1 into recvBuffer
  memcpy(recvBuffer, cbuf1, BUFFERSIZE);

  //
  packetCnt = 0;

  //
  bufferState1 = yy_scan_bytes(recvBuffer, BUFFERSIZE);

  //
  yylex();

  yy_delete_buffer(bufferState1);
  yy_delete_buffer(bufferState2);

  return 0;
}

This is my output:

dkmbpro:test dkroeske$ ./text 
Lenght: 26
Hello everybody, lex is su>> yywrap()
per cool. Thanks!         >> yywrap()

So no match on 'super'. According to the doc the lexxer is not 'reset' between yywrap's. What do I miss? Thanks.

rici · Accepted Answer · 2017-11-19T19:26:07.327

The mechanism for providing a stream of input to flex is to provide a definition of the YY_INPUT macro, which is called every time flex needs to refill its buffer [note 1]. The macro is called with three arguments, roughly like this:

YY_INPUT(buffer, &bytes_read, max_bytes)

The macro is expected to read up to max_bytes into buffer, and to set bytes_read to the actual number of bytes read. If there is no more input in this stream, YY_INPUT should set bytes_read to YY_NULL (which is 0). There is no way to flag an input error other than setting the end of file condition. Do not set YY_INPUT to a negative value.

Note that YY_INPUT does not provide an indication of where to read the input from or any sort of userdata argument. The only provided mechanism is the global yyin, which is a FILE*. (You could create a FILE* from a file/socket descriptor with fdopen and get the descriptor back with fileno. Other workarounds are beyond the scope of this answer.)

When the scanner encounters the end of a stream, as indicated by YY_INPUT returning 0, it finishes the current token [note 2], and then calls yywrap to decide whether there is another stream to process. As the manual indicates, it does not reset the parser state (that is, which start condition it happens to be in; the current line number if line counting is enabled, etc.). However, it does not allow tokens to span two streams.

The yywrap mechanism is most commonly used when a parser/scanner is applied to a number of different files specified on the command line. In that use case, it would be a bit odd if a token could start in one file and continue into another one; most language implementations prefer their files to be somewhat self-contained. (Consider multi-line string literals, for example.) Normally, you actually want to reset more of the parser state as well (the line number, certainly, and sometimes the start condition), but that is the responsibility of yywrap. [note 3]

For lexing from a socket, you'll probably want to call recv from your YY_INPUT implementation. But for experimentation purposes, here's a simple YY_INPUT which just returns data from a memory buffer:

/* Globals which describe the input buffer. */
const char* my_in_buffer = NULL;
const char* my_in_pointer = NULL;
const char* my_in_limit = NULL;
void my_set_buffer(const char* buffer, size_t buflen) {
  my_in_buffer = my_in_pointer = buffer;
  my_in_limit = my_in_buffer + buflen;
}

/* For debugging, limit the number of bytes YY_INPUT will
 * return.
 */
#define MY_MAXREAD 26

/* This is technically incorrect because it returns 0
 * on EOF, assuming that YY_NULL is 0.
 */
#define YY_INPUT(buf, ret, maxlen) do {          \
   size_t avail = my_in_limit - my_in_pointer;   \
   size_t toread = maxlen;                       \
   if (toread > avail) toread = avail;           \
   if (toread > MY_MAXREAD) toread = MY_MAXREAD; \ 
   *ret = toread;                                \
   memcpy(buf, my_inpointer, toread);            \
   my_in_pointer += toread;                      \
} while (0)

Notes

This is not quite true; the buffer state includes a flag which indicates whether the buffer can be refilled. If you use yy_scan_bytes, the buffer state created is marked as non-refillable.
It's actually a bit more complicated than that, because flex scanners sometimes need to look ahead in order to decide which token has been matched, and the end-of-stream indication might occur during the lookahead. After the scanner backs up to the end of the recognized token, it still has to rescan the lookahead characters, which may contain several more tokens. To handle this, it sets a flag in the buffer state which indicates that end-of-stream has been reached, which prevents YY_INPUT from being called each time the scanner hits the end of the buffer. Despite this, it's probably a good idea to make sure that your YY_INPUT implementation will continue to return end-of-stream in case it is called again after an end-of-stream return.
For another concrete example, suppose you wanted to implement some kind of #include mechanism. flex provides the yy_push_state/yy_pop_state mechanism which allows you to implement an include stack. You'd call yy_push_state once the include directive has been scanned, but yy_pop_state needs to be called from yywrap. Again, very few languages would allow a token to start in the included source file and continue following the include directive.

Thank you rici: I guess it's true what you point out: "it does not allow tokens to span two streams." This made me think of a solution where YY_INPUT never 'ends' and scans the incomming binary stream till the socket is closed. — Diederich Kroeske, Jun 04 '14 at 11:43

score 2 · Answer 2 · answered Jun 04 '14 at 11:54

2

Thanks to rice the answer is in redefining the YY_INPUT macro. So I did:

#undef YY_INPUT
#define YY_INPUT(buf, result, max_size) inputToFlex(buf, &result, max_size)

....

void inputToFlex(char *buf, unsigned long int *result, size_t max_size)
{
if( recv(psock, recvBuffer, RECVBUFFERSIZE, MSG_WAITALL) )
  {
      memcpy(buf, recvBuffer, RECVBUFFERSIZE );
      *result = RECVBUFFERSIZE;
  }
  else
  {
      *result = YY_NULL;
  }
}

This works perfectly, it calls yywrap() when the socket is closed (by the client). Remark the MSG_WAITALL I'm using instead of the more common '0'.

Also note rici's comment 2. If your scanner needs to look-a-head my solution is not sufficient and you need to implement '1 character overlapping buffer management'.

Thank you flex. (it also works very nice for binary streams)

answered Jun 04 '14 at 11:54

Diederich Kroeske

53
4

You probably won't run into problems with this, but it's definitely not correct: you could copy too many bytes into `buf` if `max_size` happens to be less than `RECVBUFFERSIZE`. There's no need to use your own buffer; you can read directly into flex's buffer, and you don't need to wait until the buffer is full either. So the body of your function could be simply: `{ssize_t nread = recv(psock, buf, max_size, 0); if (nread > 0) *result = nread; else { if (nread < 0) { /* log error */ }; *result = YY_NULL; }`. Also, you don't need to implement buffer management. Flex does it for you. – rici Jun 04 '14 at 14:00
Thanks, but I tried this with Flex buffers (in the recv(..)) and it didn't work. I did not investigate why but I used my own buffer. The max_size issue is correct, it's 8192 on my Centos machine and the code needs a check. Also did some speedtest and it runs faster using fixed buffer sizes (the MSG_WAITALL option). – Diederich Kroeske Jun 04 '14 at 21:22

Flex, continuous scanning stream (from socket). Did I miss something using yywrap()?

2 Answers2

Linked