6

I'm working on a shell, a small bash-like shell, without scripting (if while ...) I have to make the lexer/parser (LL) by hand.

So the lexer will transform the command (char *cmd) to a linked list (t_list *list). And the LL parser will transform the linked list (t_list *list) to an AST (binary tree t_btree *root) with a grammar

So, I know how to make the LL parser but I don't know how to tokenize my command.

For example: ps | grep ls >> file ; make && ./a.out

=> 'ps' '|' 'grep' 'ls' '>>' 'file' ';' ''make '&&' './a.out'

Thanks.

(I don't wanna use any generator)

mathieug
  • 901
  • 1
  • 11
  • 24
  • 1
    what do you have so far? where are you stuck? – Mat Mar 30 '11 at 20:16
  • 1
    Since you tagged this as 'plain C', it sounds like you should be aiming for a loop over the command string using repeated calls to `strchr(cmd, ' ')` or something of that nature. – phooji Mar 30 '11 at 20:20
  • Yes, I have a loop on my command string. But I don't know how to define my token and how to tokenize them in the string. [http://pastebin.com/K1YZchMK](http://pastebin.com/K1YZchMK) – mathieug Mar 30 '11 at 21:25
  • @Spudd86 No, I can't because there are some strings in my command string so how can I tokenize `ls -la` in one token? – mathieug Mar 30 '11 at 22:10
  • strtok to do what you indicate you want. – xcramps Mar 30 '11 at 22:12
  • @Math why in the world aren't you using flex/bison for this? It would get rid of most the tedium of building your engine, be just about as fast as anything you could create manually, and it's easier to debug and extend. – Spencer Rathbun May 04 '11 at 17:35

1 Answers1

8

(This explains the idea hinted by Spudd86).

You need to implement a finite state machine. There are the following states:

  • General state
  • Inside a file name
  • Inside the && token
  • Inside the || token

For each state and next input character, you have to decide what is the next state, and whether to output a token. For example:

  • Current state: General; character: x => next state: inside-file-name
  • Current state: inside-file-name; character: space => next state: General; output the token
  • Current state: inside-file-name; character: & => next state: inside-&&; output the token
  • Current state: inside-&&; character: & => next state: General; output the token
  • Current state: inside-&&; character: x => next state: General; syntax error
  • ... (ad nauseum)

It's much boring work to work out all the rules (the fun starts when you must debug the resulting code), so most people use code generators to do that.


Edit: some code (sorry if the syntax is messed-up; i usually program in C++)

enum state {
    STATE_GENERAL,
    STATE_IN_FILENAME,
    ...
};

// Many characters are treated the same (e.g. 'x' and 'y') - so use categories
enum character_category
{
    CHAR_GENERAL, // can appear in filenames
    CHAR_WHITESPACE = ' ',
    CHAR_AMPERSAND = '&',
    CHAR_PIPE = '|',
    CHAR_EOF = EOF,
    ...
};

character_category translate(int c)
{
    switch (c) {
    case '&': return CHAR_AMPERSAND;
    case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
    ...
    default: return CHAR_GENERAL;
    }
}

void do_stuff()
{
    character_category cat;
    state current_state = STATE_GENERAL;
    state next_state;
    char token[100];
    char token_length = 0;
    do {
        int c = getchar();
        cat = translate(c);
        // The following implements a switch on 2 variables
        int selector = 1000 * current_state + cat;

        switch (selector)
        {
        case 1000 * STATE_GENERAL + CHAR_GENERAL:
            next_state = STATE_IN_FILENAME;
            token[token_length++] = c; // append a character to a filename token
            break;

        case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
            next_state = STATE_GENERAL; // do nothing
            break;

        case 1000 * STATE_GENERAL + CHAR_PIPE:
            next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
            break;

        // Much repetitive code already; define a macro for the case constants?
        // Have to cover all states and all character categories; good luck...

        case 1000 * STATE_IN_FILENAME + EOF:
        case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
            next_state = STATE_GENERAL;
            printf("Filename token: %s\n", token);
            break;

        default:
            printf("Bug\n"); // forgot one of the cases?
        }

        current_state = next_state;

    } while (cat != CHAR_EOF);
}
anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • Thank you. I saw some things about "finite state machine" but would you mind giving me a concrete example? I mean a little "C" example. I don't know what to do with tab/space and .. Etc – mathieug Mar 30 '11 at 22:32