5

I want to build a lexer in C and I am following the dragon book, I can understand the state transitions but how to implement them?

Is there a better book?

The fact that I have to parse a string through a number of states so that I can tell whether the string is acceptable or not!

David Schmitt
  • 58,259
  • 26
  • 121
  • 165
Hick
  • 35,524
  • 46
  • 151
  • 243

5 Answers5

7

You can implement simple state transitions with a single state variable, for example if you want to cycle through the states start->part1->part2->end then you can use an enum to keep track of the current state and use a switch statement for the code you want to run in each state.

enum state { start=1, part1, part2, end} mystate;

// ...
mystate = start;
do {
  switch (mystate) {
    case start:
      // ...
    case part1:
      // ...
    case part2:
      // ...
      if (part2_end_condition) mystate = end; // state++ will also work
      // Note you could also set the state back to part1 on some condition here
      // which creates a loop
      break;
  }
} while (mystate != end);

For more complex state transitions that depend on several variables, you should use tables/arrays like this:

var1    var2    var_end    next_state
0       0       0          state1
0       1       0          state2
1       0       0          state3
1       1       0          state4
-1      -1      1          state_end // -1 represents "doesn't matter" here
schnaader
  • 49,103
  • 10
  • 104
  • 136
4

G'day,

Assuming you mean The Dragon book on compiler design, I'd recommend having a look around this page on compiler tools.

The page itself is quite small but has links through to various excellent resources on lexical analysers.

HTH

cheers,

Rob Wells
  • 36,220
  • 13
  • 81
  • 146
4

There's more than one way to do it. Every regular expression corresponds directly to a simple structured program. For example, an expression for numbers could be this:

// regular expression
digit* [.digit*]

and the corresponding C code would be:

// corresponding code
while(DIGIT(*pc)) pc++;
if (*pc=='.'){
    pc++;
    while(DIGIT(*pc)) pc++;
}

The transition-table way of building lexers is, in my opinion, needlessly complicated, and obviously runs slower.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Any particular reason why you use *pc then pc[0] then *pc again? – John Machin Jun 15 '09 at 12:15
  • @John. Fixed. I guess that's an accidental leftover from the case where I wanted to exclude the case of a stand-alone '.' by looking ahead. In other words, I should really wrap the whole thing in if(DIGIT(pc[0]) || (pc[0]=='.' && DIGIT(pc[1]))) – Mike Dunlavey Jun 15 '09 at 15:36
2

If you're looking for a more modern treatment than the dragon book(s) : Andrew W. Appel and Maia Ginsburg, Modern Compiler Implementation in C, Cambridge University Press, 2008.

Chapter 2 is focused on Lexical Analysis : Lexical tokens, Regular expressions, Finite automata; Nondeterministic Finite Automata; Lexical analyzer generators

Look at the Table of Contents

anno
  • 5,970
  • 4
  • 28
  • 37
0

The program flex (a clone of lex) will create a lexer for you.

Given an input file with the lexer rules, it will produce a C file with an implementation of a lexer for those rules.

You can thus check the output of flex for how to write a lexer in C. That is, if you don't just want to use flex's lexer...

Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • Also, Bison has a disclaimer that says that Bison-generated code can be used in non-GPL code. – Chris Lutz Jun 15 '09 at 16:33
  • updated (removed comments about GPL, my bad, sorry). Please don't call people silly. It was a little offended, but only at first. Bison did have issues with the generated code, glad to see they have added a disclaimer. – Daren Thomas Jun 16 '09 at 07:25