0

I'm writing a script interpreter and i first need to tokenize a string containing the source code. For that i've identified different things :

  • Identifiers (variable names) ;
  • Symbols (+, -, etc... including "alphabetic" operators such as "return") ;
  • Litteral values (true, false, 1, 3.14, "foo").

To represent this I thought about two different ways : either creating a class hierarchy :

class Token
{
public:
    enum type_e { E_IDENTIFIER, E_SYMBOL, E_LITTERAL }
    const type_e type;
};
class Identifier : public Token
{
public:
    const string name
}
class Symbol : public Token
{
public:
    const symbol_e symbol;
}
class Litteral : public Token
{
public:
    const Value value;
}

Which i would use with down casts this way :

bla bla parseStatement( bla bla )
{
    // ...
    Token * curTok = tokens[ curPos ];
    if( curTok->type == E_SYMBOL && dynamic_cast< Symbol * >( curTok )->symbol == E_PLUS )
    {
       // ...
    }
    // ...
}

But I was told that down casting means that my design is probably flawn. And it is also against the principle of polymorphism.

Then i thought about a second method, using some kind of variant class containing everything :

class Token
{
private:
    type_e _type;
public:
    type_e getType()
    bool isIdentifier()
    bool isSymbol()
    bool isLitteral()
    string getName() // If is an identifier, else exception
    symbol_e getSymbol() // If is a symbol, else exception
    Value getValue() // If is a litteral, else exception
}

Which i would use this way :

bla bla parseStatement( bla bla )
{
    // ...
    Token curTok = tokens[ curPos ];
    if( curTok.isSymbol() && curTok.getSymbol() == E_PLUS )
    {
       // ...
    }
    // ...
}

But it doesn't seem to me that it's cleaner. It's basically the same thing, just a little shorter to write.

I was suggested to use the visitor design pattern, but I can't figure out a way to use this pattern for my problem. I want to keep the syntax analyzing logic outside of the tokens. The logic will be in a syntax analyzing class that will manipulate these tokens. I also need the tokens to either be of the same type or have a common base class so that I can store them in a single array.

Do you have an idea how I could design this ? Thank you :)

Virus721
  • 8,061
  • 12
  • 67
  • 123
  • I like your solution better. For readability you could add `is*` and `as*` methods so your if check could read `curTok.isSymbol() && curTok.asSymbol()->symbol == E_PLUS` then `asSymbol` would do the `dynamic_cast` – AlexanderBrevig Dec 12 '14 at 10:01
  • Thanks for your answer ! It would make the code easier to read/write indeed, but it still relies on downcasting, so it only shifts the problem. – Virus721 Dec 12 '14 at 10:06
  • Downcasting is usually considered bad because you're trying to reach code that is more specific than what you asked for, but - you are essentially making a [run time type information](http://en.wikipedia.org/wiki/Run-time_type_information) system. – AlexanderBrevig Dec 12 '14 at 10:35
  • I think a huge variant class is harder to maintain and test, but I guess it's about preference. – AlexanderBrevig Dec 12 '14 at 10:36
  • Not every design should be object-oriented. A `pair` is all that's needed. – n. m. could be an AI Dec 12 '14 at 10:50

1 Answers1

1

Here is a version with Visitor Pattern

#include <iostream>
using namespace std;

class Value { };
class Identifier;
class Symbol;
class Literal;
class ParseVisitor {
public:
    virtual void VisitAndParseFor(Identifier& d1) = 0;
    virtual void VisitAndParseFor(Symbol& d2) = 0;
    virtual void VisitAndParseFor(Literal& d1) = 0;
};

class Token {
public:
    virtual void ParseWith(ParseVisitor& v) = 0;
};

class Identifier : public Token {
public:
    virtual void ParseWith(ParseVisitor& v) {
        v.VisitAndParseFor(*this);
    }
    const string name;
};

class Symbol : public Token {
public:
    enum symbol_e {
        E_PLUS, E_MINUS
    };
    virtual void ParseWith(ParseVisitor& v) {
        v.VisitAndParseFor(*this);
    }
    symbol_e symbol;
};

class Literal : public Token {
public:
    virtual void ParseWith(ParseVisitor& v) {
        v.VisitAndParseFor(*this); 
    }
    Value value;
};


// Implementing custom ParseVisitor
class Parser : public ParseVisitor {
    virtual void VisitAndParseFor(Identifier& identifier) { 
        std::printf("Parsing Identifier\n"); 
    }
    virtual void VisitAndParseFor(Symbol& symbol) { 
        std::printf("Parsing Symbol\n"); 
        switch (symbol.symbol) {
            case Symbol::symbol_e::E_PLUS: std::printf("Found plus symbol\n"); break;
            case Symbol::symbol_e::E_MINUS: std::printf("Found minus symbol\n"); break;
        }
    }
    virtual void VisitAndParseFor(Literal& literal) { 
        std::printf("Parsing Literal\n"); 
    }
};

int main() {
    Parser p;

    Identifier identifier;
    Symbol symbol;
    symbol.symbol = Symbol::symbol_e::E_PLUS;
    Literal literal;

    identifier.ParseWith(p);
    symbol.ParseWith(p);
    literal.ParseWith(p);

    return 0;
}

If you need context data when parsing, then add argument to Token::ParseWith and ParseVisitor::VisitAndParseFor. Likewise, you can change the return signature if you need to pass state/data back.

AlexanderBrevig
  • 1,967
  • 12
  • 17
  • Thanks for taking the time to answer. I had something close to that in mind when i was suggested the visitor pattern, but it seems to make things a lot more complicated than necessary. For exemple when i'm passing an expression to make a tree out of it, it feels comfortable to have the possibility to access tokens using their position in the given token array. Using this way would make this a lot more complicated i think, unless i can find a way to modify it. Thanks anyway :) – Virus721 Dec 12 '14 at 14:08
  • you could extend the aforementioned signatures to take the list of tokens and the current index - if that is context you need (I am the one that argues for your original solution in the comments on your post - it depends on the size of the compiler. This version is easier to test and to extend if you expand a lot on the Token variants) – AlexanderBrevig Dec 12 '14 at 14:32
  • No the types of tokens are very unlikely to ever change. I may add new symbols (e.g new keywords) or new litteral types (e.g characters, booleans), but the 3 categories will probably never change. And even if they ever change in a way that is too important i'd rewrite the whole thing. I'll try to find a way to modify your idea. – Virus721 Dec 12 '14 at 14:48