I'm trying to understand how lexers in programming languages work.
Take a language like Java for example. I imagine the lexer works by first splitting it into a stream of tokens using delimiters and then classify the tokens based on some regular expressions.
This approach seems plausible to me at first but then I realize Java can regconize statements such as:
int x=2;
which should not be possible because there is no white space delimiter between x, =, and 2, yet Java seems to correctly tokenize them into [identifier][operator][number][;].
So what does the lexer in this case actually do? If not using delimiters, it seems that they use some rules like: "if it starts with a character, look ahead until encounter a =, ; or a white space", "if it starts with a number, then look ahead until...". But this approach sounds very clumsy.And if that is the case, I don't see how regular expressions come into play here.
Can someone explain to me roughly how the tokenization in a lexical analyzer like Java actually work? The materials on the internet only give me a vague idea.