How to make lex/flex recognize tokens not separated by whitespace?

Question

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.

A simplified version of the code I have:

%{
#include <stdio.h>
%}

%option main warn debug

%%

if      |
then    |
else    printf("keyword: %s\n", yytext);

[[:digit:]]+    printf("number: %s\n", yytext);

[[:alpha:]][[:alnum:]]*     printf("identifier: %s\n", yytext);

[[:space:]]+    // skip whitespace
[[:^space:]]+   { printf("ERROR: %s\n", yytext); exit(1); }

%%

When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:

number: 39
keyword: if

(I.e. the same as if I entered 39 if as the input.)

Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.

UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".

a) Have you run your simplified version? b) what does it do that is wrong? — Ira Baxter, Apr 15 '13 at 23:41
@IraBaxter Sorry, seems I forgot to be explicit about my test case while lost in the speculation in the last paragraph. The answers are **a)** yes; and **b)** reports the lexer error instead of two tokens. (I've also added them into the question.) — millimoose, Apr 15 '13 at 23:45
Ah. OK, yes, your "^space" rule will eat any sequence of non-space, and thus consume "39if". Secret: avoid rules whose regexes overlap, unless the longer rule comes safely first. In your case, I'd use (I'm not a lex-pert) something to replace :^space: that was "not a digit, not a letter, not a space". ... — Ira Baxter, Apr 15 '13 at 23:57
With really good lexical specifications, you can write things like [any]*-[digit]+-[alpha][alnum]+-[space]+ to easily specify anything that doesn't look like your legal tokens. I don't believe lex will let your write this. — Ira Baxter, Apr 15 '13 at 23:58
@IraBaxter Unfortunately, for the assignment, I'm stuck to lex and a requirement that I'd never practically have to bother with. (If I ever actually had to write a language I'd probably stick to ANTLR, based on the intuition that if what I'm doing can't be LL(1) I'm the wrong person to be doing it anyway.) — millimoose, Apr 16 '13 at 00:30
What _should_ happen to white space? What tokens should be returned for `39 if` (i.e. separated by white space), or is that illegal? — Bryan Olivier, Apr 16 '13 at 08:37
@Bryan The second-to-last rule is to ignore whitespace. It's legal and should return the same as if they were concatenated. (Unless of course the concatenated word has a longer match, e.g. is a valid identifier.) — millimoose, Apr 16 '13 at 10:51
Then you should follow Ira Baxter's advice and just loose the last rule all together. Lex will always give the longest match and otherwise the first match. If white space is just ignored, then the absence of white space is not going to hurt. — Bryan Olivier, Apr 16 '13 at 11:00

rici · Accepted Answer · 2017-11-17T21:56:00.650

You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:

%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
.                 { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+  { fprintf(stderr,
                            "Invalid character '%c' found at line %d,"
                            " just before '%s'\n",
                            *yytext, yylineno, yytext+1);
                    exit(1);
                  }

Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)

Look up start conditions in the flex manual for more info about %x and BEGIN

I read up on start conditions but couldn't really put the pieces together, thanks! — millimoose, Apr 16 '13 at 14:30
It's a lot simpler to just return yytext[0] to the parser in the . rule and let the parser's error recovery deal with it. No start states required. This also eliminates all the rules for single special characters. — user207421, Apr 17 '13 at 22:03
@EJP: The OP specifically states that one of the requirements is that the lexer must `exit(1)` when it encounters invalid input. There's no indication that there is a parser at all, with or without error recovery. — rici, Apr 18 '13 at 00:34

How to make lex/flex recognize tokens not separated by whitespace?

1 Answers1

Linked