This question stems from another question I have asked over on the CS site. Reference
I have tried searching through online course notes from various Universities so that I can find the answer to a problem I am facing.
I have to implement a compiler for a custom language for an assignment. This language contains some atomic symbols like letters from the English alphabet, and digits. And I managed to find examples for these, and they're quite straightforward. For example: Go to Page 25
However, this language also contains reserved words such as if and for.
This is where I'm having problems. Suppose the lexer is trying to read the string "if (expression) statement". If I use an implementation like Page 4 it will incorrectly categorise if as an identifier.
So my idea is to implement a "lookahead" mechanism so that before the lexer categorises and sends to the DFA what is being read, it can make an informed, correct decision.
For example: The lexer encounters i. Since i can belong to a reserved word (if), the lexer should check for the next character. If it is f then the lexer should make sure it's not actually a normal string that happens to start with if, like ifxyz.
I like this idea, except I haven't managed to find anything similar from looking at course notes online, which makes me think that perhaps I'm doing something wrong.
UPDATE!! This is for those who got here through a search trying to find a solution. It's been a while, I've actually solved the issue and the answer linked in the comments is very helpful. I suggest you go read it.
Here's how I ended up solving this:
START(f) -> F
F(o) -> FO
FO(r) -> FOR
FOR(_) -> IDENTIFIER
Furthermore, all states have a "Lex As" property. Reason for this: Consider you arrive at state F with no further input. Therefore, you should assume it's an identifier (in most languages). Hence, F.lexAs would return the correct interpretation of the state, in this case, IDENTIFIER.