How to recognize ID, Literals and Comments in Lex file

Question

I have to write a lex program that has these rules:

Identifiers: String of alphanumeric (and _), starting with an alphabetic character

Literals: Integers and strings

Comments: Start with ! character, go to until the end of the line

Here is what I came up with

[a-zA-Z][a-zA-Z0-9]+    return(ID);
[+-]?[0-9]+     return(INTEGER); 
[a-zA-Z]+    return ( STRING);
!.*\n                  return ( COMMENT );

However, I still get a lot of errors when I compile this lex file.

What do you think the error is?

score 0 · Answer 1 · answered Nov 12 '16 at 12:45

It would have helped if you'd shown more clearly what the problem was with your code. For example, did you get an error message or did it not function as desired?

There are a couple of problems with your code, but it is mainly correct. The first issue I see is that you have not divided your lex program into the necessary parts with the %% divider. The first part of a lex program is the declarations section, where regular expression patterns are specified. The second part is where the action that match patterns are specified. The (optional) third section is where any code (for the compiler) is placed. Code for the compiler can also be placed in the declaration section when delineated by %{ and %} at the start of a line.

If we put your code through lex we would get this error:

"SoNov16.l", line 1: bad character: [
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: ]
"SoNov16.l", line 1: bad character: +
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: (
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: )
"SoNov16.l", line 1: bad character: ;

Did you get something like that? In your example code you are specifying actions (the return(ID); is an example of an action) and thus your code is for the second section. You therefore need to put a %% line ahead of it. It will then be a valid lex program.

You code is dependant on (probably) a parser, which consumes (and declares) the tokens. For testing purposes it is often easier to just print the tokens first. I solved this problem by making a C macro which will do the print and can be redefined to do the return at a later stage. Something like this:

%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z][a-zA-Z0-9]+    TOKEN(ID);
[+-]?[0-9]+     TOKEN(INTEGER); 
[a-zA-Z]+    TOKEN (STRING);
!.*\n                  TOKEN (COMMENT);

If we build and test this, we get the following:

abc
String: abc Matched: ID

abc123
String: abc123 Matched: ID

! comment text
String: ! comment text
Matched: COMMENT

Not quite correct. We can see that the ID rule is matching what should be a string. This is due to the ordering of the rules. We have to put the String rule first to ensure it matches first - unless of course you were supposed to match strings inside some quotes? You also missed the underline from the ID pattern. Its also a good idea to match and discard any whitespace characters:

%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z]+                TOKEN (STRING); 
[a-zA-Z][a-zA-Z0-9_]+    TOKEN(ID);
[+-]?[0-9]+              TOKEN(INTEGER);
!.*\n                    TOKEN (COMMENT);
[ \t\r\n]+               ;

Which when tested shows:

abc
String: abc Matched: STRING
abc123_
String: abc123_ Matched: ID
-1234
String: -1234 Matched: INTEGER
abc abc123 ! comment text
String: abc Matched: STRING
String: abc123 Matched: ID
String: ! comment text
Matched: COMMENT

Just in case you wanted strings in quotes, that is easy too:

%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
\"[^"]+\"    TOKEN (STRING); 
[a-zA-Z][a-zA-Z0-9_]+    TOKEN(ID);
[+-]?[0-9]+     TOKEN(INTEGER);
!.*\n                  TOKEN (COMMENT ); 
[ \t\r\n]         ;

"abc"
String: "abc" Matched: STRING

How to recognize ID, Literals and Comments in Lex file

1 Answers1