The mechanism for providing a stream of input to flex
is to provide a definition of the YY_INPUT macro, which is called every time flex
needs to refill its buffer [note 1]. The macro is called with three arguments, roughly like this:
YY_INPUT(buffer, &bytes_read, max_bytes)
The macro is expected to read up to max_bytes
into buffer
, and to set bytes_read
to the actual number of bytes read. If there is no more input in this stream, YY_INPUT
should set bytes_read
to YY_NULL
(which is 0). There is no way to flag an input error other than setting the end of file condition. Do not set YY_INPUT
to a negative value.
Note that YY_INPUT
does not provide an indication of where to read the input from or any sort of userdata
argument. The only provided mechanism is the global yyin
, which is a FILE*
. (You could create a FILE*
from a file/socket descriptor with fdopen
and get the descriptor back with fileno
. Other workarounds are beyond the scope of this answer.)
When the scanner encounters the end of a stream, as indicated by YY_INPUT
returning 0, it finishes the current token [note 2], and then calls yywrap
to decide whether there is another stream to process. As the manual indicates, it does not reset the parser state (that is, which start condition it happens to be in; the current line number if line counting is enabled, etc.). However, it does not allow tokens to span two streams.
The yywrap
mechanism is most commonly used when a parser/scanner is applied to a number of different files specified on the command line. In that use case, it would be a bit odd if a token could start in one file and continue into another one; most language implementations prefer their files to be somewhat self-contained. (Consider multi-line string literals, for example.) Normally, you actually want to reset more of the parser state as well (the line number, certainly, and sometimes the start condition), but that is the responsibility of yywrap
. [note 3]
For lexing from a socket, you'll probably want to call recv
from your YY_INPUT
implementation. But for experimentation purposes, here's a simple YY_INPUT
which just returns data from a memory buffer:
/* Globals which describe the input buffer. */
const char* my_in_buffer = NULL;
const char* my_in_pointer = NULL;
const char* my_in_limit = NULL;
void my_set_buffer(const char* buffer, size_t buflen) {
my_in_buffer = my_in_pointer = buffer;
my_in_limit = my_in_buffer + buflen;
}
/* For debugging, limit the number of bytes YY_INPUT will
* return.
*/
#define MY_MAXREAD 26
/* This is technically incorrect because it returns 0
* on EOF, assuming that YY_NULL is 0.
*/
#define YY_INPUT(buf, ret, maxlen) do { \
size_t avail = my_in_limit - my_in_pointer; \
size_t toread = maxlen; \
if (toread > avail) toread = avail; \
if (toread > MY_MAXREAD) toread = MY_MAXREAD; \
*ret = toread; \
memcpy(buf, my_inpointer, toread); \
my_in_pointer += toread; \
} while (0)
Notes
This is not quite true; the buffer state includes a flag which indicates whether the buffer can be refilled. If you use yy_scan_bytes
, the buffer state created is marked as non-refillable.
It's actually a bit more complicated than that, because flex scanners sometimes need to look ahead in order to decide which token has been matched, and the end-of-stream indication might occur during the lookahead. After the scanner backs up to the end of the recognized token, it still has to rescan the lookahead characters, which may contain several more tokens. To handle this, it sets a flag in the buffer state which indicates that end-of-stream has been reached, which prevents YY_INPUT
from being called each time the scanner hits the end of the buffer. Despite this, it's probably a good idea to make sure that your YY_INPUT
implementation will continue to return end-of-stream in case it is called again after an end-of-stream return.
For another concrete example, suppose you wanted to implement some kind of #include
mechanism. flex
provides the yy_push_state/yy_pop_state
mechanism which allows you to implement an include stack. You'd call yy_push_state
once the include
directive has been scanned, but yy_pop_state
needs to be called from yywrap
. Again, very few languages would allow a token to start in the included source file and continue following the include
directive.