4

I've been facing a design issue for a while :

  • I am parsing a source code string into a 1-dimensional array of token objects.

  • Depending on the type of a token (litteral, symbol, identifier), it has some token-type-specific data. Litterals have a value, symbols have a symbol type, and identifiers have a name.

  • I'm then building an abstract representation of the script defined in that source code string by analyzing this 1 dimensionnal array of tokens. The syntax analyzing logic is done outside of these token objects.

My problem is that I need all my tokens, no matter their type, to be stored into a single array, because it seems easier to analyze and because i don't see any other way to do it. This involves having a common type for all different token types, by either creating a class hierarchy :

class token { token_type type; };
class identifier : public token { string name; };
class litteral : public token { value val; };
class symbol : public token( symbol_type sym; };

... or by creating a variant :

class token
{
   token_type type;
   string name; // Only used when it is an identifier
   value val; // Only used when it is a litteral
   symbol_type sym; // Only used when it is a symbol
};

The class hierarchy would be used as follows :

// Iterator over token array
for( auto cur_tok : tokens )
{
    // Do token-type-specific things depending on its type
    if( cur_token->type == E_SYMBOL )
    {
        switch( ((symbol *) cur_token)->symbol_type )
        {
            // etc
        }
    }
}

But it has several problems :

  • The base token class has to know about it's subclasses, which seems wrong.

  • It involves down casting to access specific data depending on the type of a token, which i was told is wrong too.

The variant solution would be used in a similar way, without down-casting :

for( auto cur_token: tokens )
{
    if( cur_token->type == E_SYMBOL )
    {
        switch( cur_token->symbol_type )
        {
            // etc
        }
    }
}

The problem of that second solution is that it mixes everything into a single class, which doesn't seem very clean to me, as there are unused variables depending on the type of the token, and because a class should represent a single "thing" type.

Would you have another possibility to suggest to design this ? I was told about the visitor pattern, but i can't imagine how i would use it in my case.

I would like to keep the possibility of iterating over an array because i might have to iterate in both directions, from a random position and maybe multiple times.

Thank you.

Virus721
  • 8,061
  • 12
  • 67
  • 123
  • How about using template data types. Even your class could be templated – Spanky Jun 09 '15 at 09:28
  • I'm not sure to understand what you have in mind. – Virus721 Jun 09 '15 at 09:29
  • 6
    You can just use a discriminated `union`. If you want to get fancy then check out existing 3rd party library "variant" types. The alternative for clean type safety is inefficient dynamic allocation, or difficult-to-get-right placement new-ing (where each array item is essentially just a buffer large enough to hold the largest possible instance). – Cheers and hth. - Alf Jun 09 '15 at 09:36
  • AFAIK use can use templates to store data of various. If I understood your problem correctly templates could be your solution. http://www.tutorialspoint.com/cplusplus/cpp_templates.htm – Spanky Jun 09 '15 at 09:36
  • In your class hierarchy you can use a common interface like handleToken this method would be implemented in each class of the class hierarchy. Your loop could then just call token->handleToken() without needing to know anything about the exact type of the token. – Hauke S Jun 09 '15 at 09:43
  • Thanks i will have a look to those "discriminated unions". @HaukeS I would like to keep the analyzing logic outside of token objects. – Virus721 Jun 09 '15 at 09:45
  • @Virus721 - handleToken could just call the correct method in the logic implementing class. – Hauke S Jun 09 '15 at 09:51
  • @HaukeS Yeah that's a possibility, but it looks much more simple to iterate over an array than receiving a method class for each token. I think it might make things more complicated. – Virus721 Jun 09 '15 at 10:02
  • Looks like you could use [boost::variant](http://www.boost.org/doc/libs/1_58_0/doc/html/variant.html) – tsuki Jun 09 '15 at 10:04

1 Answers1

4

Option 1: "fat" type with some shared / some dedicated fields

Pick a set of data members that can be repurposed in a token-type specific way for your "some token-type-specific data. Literals have a value, symbols have a symbol type, and identifiers have a name."

 struct Token
 {
    enum Type { Literal, Symbol, Identifier } type_;

    // fields repurposed per Token-Type
    std::string s_;  // text of literal or symbol or identifier

    // fields specific to one Token-Type
    enum Symbol_Id { A, B, C } symbol_id_;
 };

A problem with this is that the names of shared fields may be overly vague so that they're not actively misleading for any given Token Type, while the "specific" fields are still accessible and subject to abuse when the Token is of another type.

Option 2: discriminated union - preferably nicely packaged for you ala boost::variant<>:

struct Symbol { ... };
struct Identifier { ... };
struct Literal { ... };
typedef boost::variant<Symbol, Identifier, Literal> Token;
std::list<Token> tokens;

See the tutorial for data retrieval options.

Option 3: OOD - classic Object Oriented approach:

Almost what you had, but crucially the Token type needs a virtual destructor.

struct Token { virtual ~Token(); };
struct Identifier : Token { string name; };
struct Literal : Token { value val; };
struct Symbol : Token { symbol_type sym; };

std::vector<std::unique_ptr<Token>> tokens_;
tokens_.emplace_back(new Identifier { the_name });

You don't need a "type" field as you can use C++'s RTTI to check whether a specific Token* addresses a specific derived type:

if (Literal* p = dynamic_cast<Literal>(tokens_[0].get()))
    ...it's a literal, can use p->val; ...

You're concerns were:

•The base token class has to know about it's subclasses, which seems wrong.

Not necessary given RTTI.

•It involves down casting to access specific data depending on the type of a token, which i was told is wrong too.

It often is, as often in OO it's practical and desirable to create a base class API expressing a set of logical operations that the entire hierarchy can implement, but in your case there might be a need for a "fat" interface (which means - lots of operations that - if they were in the API - would be confusing no-ops (i.e. do nothing) or somehow (e.g. return value, exceptions) report that many operations were not supported. For example, getting the symbol type classification isn't meaningful for non-symbols. Making it accessible only after a dynamic_cast is a little better than having it always accessible but only sometimes meaningful, as in "option 1", because after the cast there's compile-time checking of the usage.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • 1
    +1 for `Option 2` with `boost::variant` which would be even more awesome with more details like other options :) – Drax Jun 09 '15 at 10:13
  • @Drax: true, but the linked docs leads to the Tutorial with 1 extra click, and I can't compete with that. – Tony Delroy Jun 09 '15 at 10:14
  • seems fair enough :) – Drax Jun 09 '15 at 10:15
  • Thanks for taking the time to make such a detailed answer. Concerning option 1, it's a bit against the purpose of tokenizing the string, because if i store tokens into strings without parsing them a first time, the result is just the input string cut into pieces, without the parsing done on numeric values, without symbols converted to enum values, etc... For option 2 it's maybe a good idea, but i'm doing this in order to write things myself, so i do not want to use libraries doing the work for me. – Virus721 Jun 09 '15 at 12:24
  • ...And what's troubling me in using a variant is not wasting a few octets but having everything put together in a single object. Option 3 is the one that looks the more natural to me, and if you say that, in this case, down casting is fine, then I should probably stick with that. However people say that dynamic_cast is very slow, so shouldn't I use a type field and static casts instead ? Thank you. – Virus721 Jun 09 '15 at 12:24
  • @Virus721: re option 1 - the numeric or `enum` value could be two (or one shared) field(s), and you can mix 1 with 2 by using a `union` for mutually exclusive fields (which is what a discriminated union is: your `Type` is the type-discriminating value, the rest can be fields, unions of fields, or unions of structs etc.. Anyway - three's not a bad option if you're more comfortable with that. Re `dynamic_cast` performance - I wouldn't let people put you off so easily... write better code with it, if it's too slow you can easily benchmark the type + static cast alternative. – Tony Delroy Jun 09 '15 at 16:42