I'm writing a compiler. I'm at the first phase, trying to tokenize everything. I wrote it all up, but I get an error. I've read the docs (smlnj) three or four times, and the errors are not very informative.
I think I must be messing up the state change aspect of the program, it works fine for the things that just create tokens, but when I change to a state using YYBEGIN, it blows up.
Here is my lex file:
type pos = int;
type lexresult = Tokens.token;
val lineNum = ErrorMsg.lineNum;
val linePos = ErrorMsg.linePos;
val commentDepth = ref 0;
fun incCom(cmDepth) = cmDepth := !cmDepth + 1;
fun decCom(cmDepth) = cmDepth := !cmDepth - 1;
fun err(p1,p2) = ErrorMsg.error p1;
fun eof() = let val pos = hd(!linePos) in Tokens.EOF(pos,pos) end;
%%
digits=[0-9]+;
%s COMMENT STRING;
%%
<INITIAL,COMMENT>\n => (lineNum := !lineNum+1; linePos := yypos :: !linePos; continue());
<INITIAL>"type" => (Tokens.TYPE(yypos, yypos+4));
<INITIAL>"var" => (Tokens.VAR(yypos,yypos+3));
<INITIAL>"function" => (Tokens.FUNCTION(yypos, yypos+8));
<INITIAL>"break" => (Tokens.BREAK(yypos, yypos+5));
<INITIAL>"of" => (Tokens.OF(yypos, yypos+2));
<INITIAL>"end" => (Tokens.END(yypos, yypos+3));
<INITIAL>"in" => (Tokens.IN(yypos, yypos+2));
<INITIAL>"nil" => (Tokens.NIL(yypos, yypos+3));
<INITIAL>"let" => (Tokens.LET(yypos, yypos+3));
<INITIAL>"do" => (Tokens.DO(yypos, yypos+2));
<INITIAL>"to" => (Tokens.TO(yypos, yypos+2));
<INITIAL>"for" => (Tokens.FOR(yypos, yypos+3));
<INITIAL>"while" => (Tokens.WHILE(yypos, yypos+5));
<INITIAL>"else" => (Tokens.ELSE(yypos, yypos+4));
<INITIAL>"then" => (Tokens.THEN(yypos, yypos+4));
<INITIAL>"if" => (Tokens.IF(yypos, yypos+2));
<INITIAL>"array" => (Tokens.ARRAY(yypos, yypos+5));
<INITIAL>":=" => (Tokens.ASSIGN(yypos, yypos+2));
<INITIAL>"|" => (Tokens.OR(yypos, yypos+1));
<INITIAL>"&" => (Tokens.AND(yypos, yypos+1));
<INITIAL>">=" => (Tokens.GE(yypos, yypos+2));
<INITIAL>">" => (Tokens.GT(yypos, yypos+1));
<INITIAL>"<=" => (Tokens.LE(yypos, yypos+2));
<INITIAL>"<" => (Tokens.LT(yypos, yypos+1));
<INITIAL>"<>" => (Tokens.NEQ(yypos, yypos+2));
<INITIAL>"=" => (Tokens.EQ(yypos, yypos+1));
<INITIAL>"/" => (Tokens.DIVIDE(yypos, yypos+1));
<INITIAL>"*" => (Tokens.TIMES(yypos, yypos+1));
<INITIAL>"-" => (Tokens.MINUS(yypos, yypos+1));
<INITIAL>"+" => (Tokens.PLUS(yypos, yypos+1));
<INITIAL>"." => (Tokens.DOT(yypos, yypos+1));
<INITIAL>"}" => (Tokens.RBRACE(yypos, yypos+1));
<INITIAL>"{" => (Tokens.LBRACE(yypos, yypos+1));
<INITIAL>"]" => (Tokens.RBRACK(yypos, yypos+1));
<INITIAL>"[" => (Tokens.LBRACK(yypos, yypos+1));
<INITIAL>")" => (Tokens.RPAREN(yypos, yypos+1));
<INITIAL>"(" => (Tokens.LPAREN(yypos, yypos+1));
<INITIAL>";" => (Tokens.SEMICOLON(yypos, yypos+1));
<INITIAL>":" => (Tokens.COLON(yypos, yypos+1));
<INITIAL>"," => (Tokens.COMMA(yypos,yypos+1));
<INITIAL>{digits} => (Tokens.INT(valOf(Int.fromString(yytext)), yypos, yypos + (size yytext)));
<INITIAL>[a-z][a-z0-9_]* => (Tokens.ID(yytext, yypos, yypos + (size yytext)));
<INITIAL>(").*(") => (Tokens.STRING(yytext, yypos, yypos + (size yytext)));
<INITIAL>"\"" => (YYBEGIN STRING; continue());
<STRING>"\"" => (YYBEGIN INITIAL; continue());
<INITIAL>"/*" => (incCom commentDepth; YYBEGIN COMMENT; continue());
<COMMENT>"/*" => (incCom commentDepth; continue());
<COMMENT>"*/" => (print "OTHER TRACE!\n"; decCom commentDepth; if !commentDepth <= 0 then YYBEGIN INITIAL else (); continue());
<INITIAL,COMMENT>[\ \t]+ => (print "TRACE 22222\n"; continue());
<INITIAL>. => (ErrorMsg.error yypos ("illegal character " ^ yytext); continue());
And here is the source file I'm tokenizing:
var , 123
/* some comment */
234 "d"
It doesn't like my comments and it doesn't like my strings. Thanks for the help.
EDIT: So here is my updated lex file. I have pinpointed where it breaks. I detects the start of the new comment just fine, it switches to COMMENT state just fine, it detects the space after the comment just fine, but then it breaks, it never gets to the point where it eats up the int.