-2

I have to create transition diagrams for a lexical analyzer for the identifiers and numbers.

The code is included below:

/* recursive factorial function */
int fact (int x )
{ 
   if (x>1)
      return x * fact (x-1); 
   else 
      return 1; 
} 

void main (void)
{
    int x;    
    x = read(); 
    if (x > 0) write (fact (x)); 
} 

I am feeling a little lost on how to create this diagram. Can anyone point me in the right direction or include resources that may help me with this task?

Ben Wainwright
  • 4,224
  • 1
  • 18
  • 36
John Doe
  • 3
  • 2

2 Answers2

1

Malcolm McLean told you how to do it in actual code but I think you need a more theoretical approach with a finite state machine.

At first do an inventory check: what is needed, what symbols do we have etc. EBNF from the example code:

space = ? US-ASCII character 32 ?;
zero = '0';
digit = '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
character = 'a' | 'A' | 'b' | 'B' ... 'z' | 'Z';

(* a single digit might be zero but a number must not start with a zero (no octals) *)
integer = (digit|zero) | ( digit,{(digit|zero)});
(* identifier must start with a character *)
identifier = character,{ (digit | character) };
(* the keywords from the example, feel free to add more *)
keywords = "if" | "else" | "return" | "int" | "void";

(* TODO: line-end, tabs, etc. *)
delimiter = space, {space};

braceleft = '{';
braceright = '}';
parenleft = '(';
parenright = ')';

equal = '=';
greater = '>';
smaller = '<';

minus = '-';
product = '*';

semicolon = ';'

end = ? byte denoting EOF (end of file) ?;

Now make a transition table. Start with the state START. START is just the start state, nothing special, nothing to do but we need to start somewhere. So from there we can get any of the above characters. Actually, that is always the case, after every state, so we can do C&P;

START
      zero        ->  ZERO
      digit       ->  INTEGER
      character   ->  IDENTIFIER
      space       ->  START
      braceleft   ->  BRACES
      braceright  ->  BRACES
      parenleft   ->  PARENTHESES
      parenright  ->  PARENTHESES
      equal       ->  COMPARING
      greater     ->  COMPARING
      smaller     ->  COMPARING
      minus       ->  ARITHMETIC
      product     ->  ARITHMETIC
      semicolon   ->  START
      end         ->  END

ZERO
      zero        ->  ERROR (well...)
      digit       ->  ERROR
      character   ->  ERROR
      space       ->  START
      braceleft   ->  BRACES
      braceright  ->  BRACES
      parenleft   ->  PARENTHESES
      parenright  ->  PARENTHESES
      equal       ->  COMPARING
      greater     ->  COMPARING
      smaller     ->  COMPARING
      minus       ->  ARITHMETIC
      product     ->  ARITHMETIC
      semicolon   ->  START
      end         ->  END

INTEGER
      zero        ->  INTEGER
      digit       ->  INTEGER
      character   ->  ERROR
      space       ->  START
      braceleft   ->  BRACES
      braceright  ->  BRACES
      parenleft   ->  PARENTHESES
      parenright  ->  PARENTHESES
      equal       ->  COMPARING
      greater     ->  COMPARING
      smaller     ->  COMPARING
      minus       ->  ARITHMETIC
      product     ->  ARITHMETIC
      semicolon   ->  START
      end         ->  END

The state IDENTIFIER means that we already have a character, so

IDENTIFIER
      zero        ->  IDENTIFIER
      digit       ->  IDENTIFIER
      character   ->  IDENTIFIER
      space       ->  START
      braceleft   ->  BRACES
      braceright  ->  BRACES
      parenleft   ->  PARENTHESES
      parenright  ->  PARENTHESES
      equal       ->  COMPARING
      greater     ->  COMPARING
      smaller     ->  COMPARING
      minus       ->  ARITHMETIC
      product     ->  ARITHMETIC
      semicolon   ->  START
      end         ->  END

There is nothing that follows the state ERROR except the state ERROR

ERROR -> ERROR

There is nothing that follows the state END except the state ERROR

END -> ERROR



ARITHMETIC
      zero        ->  ZERO
      digit       ->  INTEGER
      character   ->  IDENTIFIER
      space       ->  START
      braceleft   ->  BRACES
      braceright  ->  BRACES
      parenleft   ->  PARENTHESES
      parenright  ->  PARENTHESES
      equal       ->  COMPARING
      greater     ->  COMPARING
      smaller     ->  COMPARING
      minus       ->  ARITHMETIC
      product     ->  ARITHMETIC
      semicolon   ->  START
      end         ->  END

Leave counting and balance checking to the parser

BRACES -> START
PARENTHESES -> START

COMPARING
      zero        ->  ZERO
      digit       ->  INTEGER
      character   ->  IDENTIFIER
      space       ->  START
      braceleft   ->  BRACES
      braceright  ->  BRACES
      parenleft   ->  PARENTHESES
      parenright  ->  PARENTHESES
      equal       ->  ERROR (only check for single characters here, no ">=" or similar)
      greater     ->  ERROR
      smaller     ->  ERROR
      minus       ->  ERROR
      product     ->  ERROR
      semicolon   ->  ERROR
      end         ->  ERROR

In the hope that I did not implement any grave error the only problems left are that of the spaces and the keywords. With the example "if":

At the first occurance of a character

      character   ->  KEYWORDS

KEYWORDS
      'i' -> IF
      'r' -> RETURN
      ...
      any other character (exc. parens etc.) -> IDENTIFIER

IF
      'f' -> IT_IS_IF
      ...
      any other character (exc. parens etc.) -> IDENTIFIER

IT_IS_IF
      '(' -> START
      ')' -> ERROR
      '=' -> ERROR
      ...
      digit or character -> IDENTIFIER 

You can do it with a shortcut, of course, and make every keyword a single symbol, it would be quite tedious otherwise. A bit of cheating is allowed, I guess?

Again at the first occurance of a character

      character   ->  KEYWORDS

KEYWORDS
      if_symbol -> IF
      else_symbol -> ELSE
      return_symbol -> RETURN
      ...
      digit or character -> IDENTIFIER 

IF
      '(' -> PARENTHESES
      ')' -> ERROR
      '=' -> ERROR
      ...

So, can you just skip all white-space? A construct like

return x;

is as legit as is

returnx;

So, once you have a keyword in full it is either followed by a space (or a semicolon or braces or whatever symbol after a certain resevered word is allowed) or followed by a character/digit which makes it an identifier, or followed by something that is not allowed. The rest can, and should be left to the parser.

Or you take the first-hit approach: once you have a keyword you go back to start, so returnx; would be seen as RETURN IDENTIFIER SEMICOLON. But that would reduce the number of possible identifiers e.g.: ifitsone would be IF ERROR and that would most probably result in a lot of angry entries in your buglist.

With all of the information above you can build the table. If we set the rows to the states and the columns to the symbols

             zero        digit     character  space  braceleft  braceright  parenleft    ...
START        ZERO       INTEGER   IDENTIFIER  START    BRACES     BRACES   PARENTHESES   ...
ZERO         ERROR       ERROR     ERROR      START    BRACES     BRACES   PARENTHESES   ...
INTEGER      INTEGER    INTEGER    ERROR      START    BRACES     BRACES   PARENTHESES   ...
IDENTIFIER  IDENTIFIER IDENTIFIER IDENTIFIER  START    BRACES     BRACES   PARENTHESES   ...
  ... 

Beware: all of the above is quite simplified and may contain errors! But that's basically how it works, it's not that complicated, it just has some fancy names you have to learn.

Just saw that Malcolm McLean's answer was deemed acceptable, so...

deamentiaemundi
  • 5,502
  • 2
  • 12
  • 20
0

The lexer starts off in a null or initial state. It hits the "i". So it knows it must have either a keyword or an identifier. It hits the 'n' and the 't' and adds them to the token. it hits the space. So it knows that's the end of the token, which is "int", a keyword. Now it hits the 'f'. Same story, but the token is "fact", that's not a keyword, so it's an identifier. Now the '(' - that's an open parenthesis token. So it goes on.

When it hit's the '/' that could be either a division token or the start of a comment, in fact it's the start of a comment. So it now goes into comment state until it hits the */.

There's nothing else significantly different, except that you have a few integer literal tokens in there. To make it easy for you, there are no strings. main is a bit of a special case, depending how the lexer is written, it could be regarded as a keyword or a plain identifier.

Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18