Malcolm McLean told you how to do it in actual code but I think you need a more theoretical approach with a finite state machine.
At first do an inventory check: what is needed, what symbols do we have etc. EBNF from the example code:
space = ? US-ASCII character 32 ?;
zero = '0';
digit = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
character = 'a' | 'A' | 'b' | 'B' ... 'z' | 'Z';
(* a single digit might be zero but a number must not start with a zero (no octals) *)
integer = (digit|zero) | ( digit,{(digit|zero)});
(* identifier must start with a character *)
identifier = character,{ (digit | character) };
(* the keywords from the example, feel free to add more *)
keywords = "if" | "else" | "return" | "int" | "void";
(* TODO: line-end, tabs, etc. *)
delimiter = space, {space};
braceleft = '{';
braceright = '}';
parenleft = '(';
parenright = ')';
equal = '=';
greater = '>';
smaller = '<';
minus = '-';
product = '*';
semicolon = ';'
end = ? byte denoting EOF (end of file) ?;
Now make a transition table. Start with the state START. START is just the start state, nothing special, nothing to do but we need to start somewhere. So from there we can get any of the above characters. Actually, that is always the case, after every state, so we can do C&P;
START
zero -> ZERO
digit -> INTEGER
character -> IDENTIFIER
space -> START
braceleft -> BRACES
braceright -> BRACES
parenleft -> PARENTHESES
parenright -> PARENTHESES
equal -> COMPARING
greater -> COMPARING
smaller -> COMPARING
minus -> ARITHMETIC
product -> ARITHMETIC
semicolon -> START
end -> END
ZERO
zero -> ERROR (well...)
digit -> ERROR
character -> ERROR
space -> START
braceleft -> BRACES
braceright -> BRACES
parenleft -> PARENTHESES
parenright -> PARENTHESES
equal -> COMPARING
greater -> COMPARING
smaller -> COMPARING
minus -> ARITHMETIC
product -> ARITHMETIC
semicolon -> START
end -> END
INTEGER
zero -> INTEGER
digit -> INTEGER
character -> ERROR
space -> START
braceleft -> BRACES
braceright -> BRACES
parenleft -> PARENTHESES
parenright -> PARENTHESES
equal -> COMPARING
greater -> COMPARING
smaller -> COMPARING
minus -> ARITHMETIC
product -> ARITHMETIC
semicolon -> START
end -> END
The state IDENTIFIER
means that we already have a character, so
IDENTIFIER
zero -> IDENTIFIER
digit -> IDENTIFIER
character -> IDENTIFIER
space -> START
braceleft -> BRACES
braceright -> BRACES
parenleft -> PARENTHESES
parenright -> PARENTHESES
equal -> COMPARING
greater -> COMPARING
smaller -> COMPARING
minus -> ARITHMETIC
product -> ARITHMETIC
semicolon -> START
end -> END
There is nothing that follows the state ERROR
except the state ERROR
ERROR -> ERROR
There is nothing that follows the state END
except the state ERROR
END -> ERROR
ARITHMETIC
zero -> ZERO
digit -> INTEGER
character -> IDENTIFIER
space -> START
braceleft -> BRACES
braceright -> BRACES
parenleft -> PARENTHESES
parenright -> PARENTHESES
equal -> COMPARING
greater -> COMPARING
smaller -> COMPARING
minus -> ARITHMETIC
product -> ARITHMETIC
semicolon -> START
end -> END
Leave counting and balance checking to the parser
BRACES -> START
PARENTHESES -> START
COMPARING
zero -> ZERO
digit -> INTEGER
character -> IDENTIFIER
space -> START
braceleft -> BRACES
braceright -> BRACES
parenleft -> PARENTHESES
parenright -> PARENTHESES
equal -> ERROR (only check for single characters here, no ">=" or similar)
greater -> ERROR
smaller -> ERROR
minus -> ERROR
product -> ERROR
semicolon -> ERROR
end -> ERROR
In the hope that I did not implement any grave error the only problems left are that of the spaces and the keywords.
With the example "if":
At the first occurance of a character
character -> KEYWORDS
KEYWORDS
'i' -> IF
'r' -> RETURN
...
any other character (exc. parens etc.) -> IDENTIFIER
IF
'f' -> IT_IS_IF
...
any other character (exc. parens etc.) -> IDENTIFIER
IT_IS_IF
'(' -> START
')' -> ERROR
'=' -> ERROR
...
digit or character -> IDENTIFIER
You can do it with a shortcut, of course, and make every keyword a single symbol, it would be quite tedious otherwise. A bit of cheating is allowed, I guess?
Again at the first occurance of a character
character -> KEYWORDS
KEYWORDS
if_symbol -> IF
else_symbol -> ELSE
return_symbol -> RETURN
...
digit or character -> IDENTIFIER
IF
'(' -> PARENTHESES
')' -> ERROR
'=' -> ERROR
...
So, can you just skip all white-space? A construct like
return x;
is as legit as is
returnx;
So, once you have a keyword in full it is either followed by a space (or a semicolon or braces or whatever symbol after a certain resevered word is allowed) or followed by a character/digit which makes it an identifier, or followed by something that is not allowed. The rest can, and should be left to the parser.
Or you take the first-hit approach: once you have a keyword you go back to start, so returnx;
would be seen as RETURN IDENTIFIER SEMICOLON
. But that would reduce the number of possible identifiers e.g.: ifitsone
would be IF ERROR
and that would most probably result in a lot of angry entries in your buglist.
With all of the information above you can build the table. If we set the rows to the states and the columns to the symbols
zero digit character space braceleft braceright parenleft ...
START ZERO INTEGER IDENTIFIER START BRACES BRACES PARENTHESES ...
ZERO ERROR ERROR ERROR START BRACES BRACES PARENTHESES ...
INTEGER INTEGER INTEGER ERROR START BRACES BRACES PARENTHESES ...
IDENTIFIER IDENTIFIER IDENTIFIER IDENTIFIER START BRACES BRACES PARENTHESES ...
...
Beware: all of the above is quite simplified and may contain errors! But that's basically how it works, it's not that complicated, it just has some fancy names you have to learn.
Just saw that Malcolm McLean's answer was deemed acceptable, so...