Trying to understand the Lex syntax for Standard ML (ml-lex)

Question

I'm writing a compiler. I'm at the first phase, trying to tokenize everything. I wrote it all up, but I get an error. I've read the docs (smlnj) three or four times, and the errors are not very informative.

I think I must be messing up the state change aspect of the program, it works fine for the things that just create tokens, but when I change to a state using YYBEGIN, it blows up.

Here is my lex file:

type pos = int;
type lexresult = Tokens.token;

val lineNum = ErrorMsg.lineNum;
val linePos = ErrorMsg.linePos;
val commentDepth = ref 0;

fun incCom(cmDepth) = cmDepth := !cmDepth + 1;
fun decCom(cmDepth) = cmDepth := !cmDepth - 1;

fun err(p1,p2) = ErrorMsg.error p1;

fun eof() = let val pos = hd(!linePos) in Tokens.EOF(pos,pos) end;



%% 
digits=[0-9]+;

%s COMMENT STRING;

%%

<INITIAL,COMMENT>\n         => (lineNum := !lineNum+1; linePos := yypos :: !linePos; continue());
<INITIAL>"type"             => (Tokens.TYPE(yypos, yypos+4));
<INITIAL>"var"              => (Tokens.VAR(yypos,yypos+3));
<INITIAL>"function"         => (Tokens.FUNCTION(yypos, yypos+8));
<INITIAL>"break"            => (Tokens.BREAK(yypos, yypos+5));
<INITIAL>"of"               => (Tokens.OF(yypos, yypos+2));
<INITIAL>"end"              => (Tokens.END(yypos, yypos+3));
<INITIAL>"in"               => (Tokens.IN(yypos, yypos+2));
<INITIAL>"nil"              => (Tokens.NIL(yypos, yypos+3));
<INITIAL>"let"              => (Tokens.LET(yypos, yypos+3));
<INITIAL>"do"               => (Tokens.DO(yypos, yypos+2));
<INITIAL>"to"               => (Tokens.TO(yypos, yypos+2));
<INITIAL>"for"              => (Tokens.FOR(yypos, yypos+3));
<INITIAL>"while"            => (Tokens.WHILE(yypos, yypos+5));
<INITIAL>"else"             => (Tokens.ELSE(yypos, yypos+4));
<INITIAL>"then"             => (Tokens.THEN(yypos, yypos+4));
<INITIAL>"if"               => (Tokens.IF(yypos, yypos+2));
<INITIAL>"array"            => (Tokens.ARRAY(yypos, yypos+5));
<INITIAL>":="               => (Tokens.ASSIGN(yypos, yypos+2));
<INITIAL>"|"                => (Tokens.OR(yypos, yypos+1));
<INITIAL>"&"                => (Tokens.AND(yypos, yypos+1));
<INITIAL>">="               => (Tokens.GE(yypos, yypos+2));
<INITIAL>">"                => (Tokens.GT(yypos, yypos+1));
<INITIAL>"<="               => (Tokens.LE(yypos, yypos+2));
<INITIAL>"<"                => (Tokens.LT(yypos, yypos+1));
<INITIAL>"<>"               => (Tokens.NEQ(yypos, yypos+2));
<INITIAL>"="                => (Tokens.EQ(yypos, yypos+1));
<INITIAL>"/"                => (Tokens.DIVIDE(yypos, yypos+1));
<INITIAL>"*"                => (Tokens.TIMES(yypos, yypos+1));
<INITIAL>"-"                => (Tokens.MINUS(yypos, yypos+1));
<INITIAL>"+"                => (Tokens.PLUS(yypos, yypos+1));
<INITIAL>"."                => (Tokens.DOT(yypos, yypos+1));
<INITIAL>"}"                => (Tokens.RBRACE(yypos, yypos+1));
<INITIAL>"{"                => (Tokens.LBRACE(yypos, yypos+1));
<INITIAL>"]"                => (Tokens.RBRACK(yypos, yypos+1));
<INITIAL>"["                => (Tokens.LBRACK(yypos, yypos+1));
<INITIAL>")"                => (Tokens.RPAREN(yypos, yypos+1));
<INITIAL>"("                => (Tokens.LPAREN(yypos, yypos+1));
<INITIAL>";"                => (Tokens.SEMICOLON(yypos, yypos+1));
<INITIAL>":"                => (Tokens.COLON(yypos, yypos+1));
<INITIAL>","                => (Tokens.COMMA(yypos,yypos+1));


<INITIAL>{digits}           => (Tokens.INT(valOf(Int.fromString(yytext)), yypos, yypos + (size yytext)));
<INITIAL>[a-z][a-z0-9_]*    => (Tokens.ID(yytext, yypos, yypos + (size yytext)));
<INITIAL>(").*(")           => (Tokens.STRING(yytext, yypos, yypos + (size yytext)));
<INITIAL>"\""               => (YYBEGIN STRING; continue());
<STRING>"\""                => (YYBEGIN INITIAL; continue());

<INITIAL>"/*"       => (incCom commentDepth; YYBEGIN COMMENT; continue());
<COMMENT>"/*"       => (incCom commentDepth; continue());
<COMMENT>"*/"       => (print "OTHER TRACE!\n"; decCom commentDepth; if !commentDepth <= 0 then YYBEGIN INITIAL else (); continue());

<INITIAL,COMMENT>[\ \t]+    => (print "TRACE 22222\n"; continue());
<INITIAL>.                  => (ErrorMsg.error yypos ("illegal character " ^ yytext); continue());

And here is the source file I'm tokenizing:

var , 123
/* some comment */
234 "d"

It doesn't like my comments and it doesn't like my strings. Thanks for the help.

EDIT: So here is my updated lex file. I have pinpointed where it breaks. I detects the start of the new comment just fine, it switches to COMMENT state just fine, it detects the space after the comment just fine, but then it breaks, it never gets to the point where it eats up the int.

rici · Accepted Answer · 2014-04-10T17:38:33.320

Comments are terminated by */, not *\. (<COMMENT>"*\\" =>). And surely you need <COMMENT>. rule to deal with the comment itself.

I don't see any lexical rule for state <STRING>; if there isn't one, then that will be the problem with strings. Otherwise, it's something to do with those rules, I think.

Edit based on edited question (not the best use of SO, IMHO):

I'm not an expert in SML lexing, but it seems to me that you would need a rule to deal with the contents of comments and strings (as I said above in the first paragraph). In other words, there is no rule which will apply in state <COMMENT> or state <STRING> when a character other than the terminating sequence is encountered (or, in the case of comments, whitespace.)

Trying to understand the Lex syntax for Standard ML (ml-lex)

1 Answers1