0

I'm trying to understand how lexers in programming languages work.

Take a language like Java for example. I imagine the lexer works by first splitting it into a stream of tokens using delimiters and then classify the tokens based on some regular expressions.

This approach seems plausible to me at first but then I realize Java can regconize statements such as:

int x=2;

which should not be possible because there is no white space delimiter between x, =, and 2, yet Java seems to correctly tokenize them into [identifier][operator][number][;].

So what does the lexer in this case actually do? If not using delimiters, it seems that they use some rules like: "if it starts with a character, look ahead until encounter a =, ; or a white space", "if it starts with a number, then look ahead until...". But this approach sounds very clumsy.And if that is the case, I don't see how regular expressions come into play here.

Can someone explain to me roughly how the tokenization in a lexical analyzer like Java actually work? The materials on the internet only give me a vague idea.

Tung Nguyen
  • 149
  • 1
  • 8
  • I am wondering if this question should not be moved to a more appropriate forum: [Computer Science](https://cs.stackexchange.com) – sophros Jun 07 '18 at 17:47
  • A typical (usually machine generated) lexer runs a big Deterministic Finite State Automaton, which recognizes items using a form of regular expression matching. See https://stackoverflow.com/q/14419614/1256452 for more. – torek Jun 07 '18 at 18:27
  • 2
    https://stackoverflow.com/a/46811729/1566221 – rici Jun 07 '18 at 18:31
  • 2
    Logically speaking a scanner is an FSA, although it may be implemented by hand, which processes one character at a time. It doesn't look ahead at the entire line and split it up first. – user207421 Jun 07 '18 at 18:40
  • @sophros it actually should be. I'm sorry about that. It didnt cross my mind that I should have posted there. – Tung Nguyen Jun 07 '18 at 19:50
  • I think you guys might misunderstand my question. I'm not asking about how a DFA for a regex could be implemented in reality. I'm asking about how the compiler actually splits the text into tokens to be matched against the regex. Because surely it doesn't look at the whole text and then match it against a regex since the grammar of java is context-free ( or at least something like it) – Tung Nguyen Jun 07 '18 at 19:59
  • 1
    @tung: did you look at the answer I linked in a comment above? It has a description of the typical lexical analyzer. – rici Jun 07 '18 at 20:58
  • @rici: I'm so sorry that I was in a hurry that I didn't look at your answer carefully. Now I find that it is perfectly what I'm looking for. Your key word "maximal munch" has led me to this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.161&rep=rep1&type=pdf, which indeed affirms that most textbook does not treat this subject carefully. Thanks again, I will close this question. – Tung Nguyen Jun 07 '18 at 22:04
  • 1
    No misunderstanding. I'm *telling* you how the compiler works in reality. It doesn't read a line and split, it uses a DFA, real or hand-written. – user207421 Jun 07 '18 at 23:45

0 Answers0