1

I am working on a lexer. I have a Token struct, which looks like this:

struct Token {
    enum class Type { ... };
    
    Type type;
    std::string_view lexeme;
}

The Token's lexeme is just a view to a small piece of the full source code (which, by the way, is also std::string_view).

The problem is that I need to re-map special characters (for instance, '\n'). Storing them as-is isn't a nice solution.

I've tried replacing lexeme's type with std::variant<std::string, std::string_view>, but it has quickly become spaghetti code, as every time I want to read the lexeme (for example, to check if the type is Bool and lexeme is "true") it's a big pain.

Storing lexeme as an owning string won't solve the problem.

By the way, I use C++20; maybe there is a nice solution for it?

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
kteperin
  • 11
  • 3
  • 1
    Unfortunately, C++ does not have a reputation for being nice. There can be various solutions to this, but they would be highly context-dependent, and tailored to the rest of the code. – Sam Varshavchik Aug 14 '23 at 19:03
  • 1
    As `std::string` and `std::string_view` have similar interface, it seems that `std::visit` code would be simple... – Jarod42 Aug 14 '23 at 19:08
  • 1
    Please show an example where `std::variant` is causing you trouble. Assuming the owning/non-owning combination works out for your use case, I don't see why this should result in particularly convoluted code. Of course working with `std::variant` is far from being as nice as in languages with proper sum types. – user17732522 Aug 14 '23 at 19:35
  • Is using `std::string_view` here worth the pain, in terms of performance (and maybe memory overhead, as well as, effectively, risk of dangling pointers)? Maybe SSO (short string optimisation) will come to your aid. – Paul Sanders Aug 14 '23 at 19:55
  • 1
    I would solve this in another way: *always* have both a `std::string_view` *and* an `std::string`. The string_view can point to the original source text or the Token's own `std::string`; the ` std::string` in the Token is empty if unnecessary. – MSalters Aug 15 '23 at 07:53

2 Answers2

2

It seems to me that all you need is to encapsulate the variant to provide a uniform interface to both. Since it is dirt-cheap to convert an std::string to an std::string_view and it is equally cheap to copy an std::string_view, you can just create a method for that and access the content like that.

struct OptOwnString
{
    using variant_t = std::variant<std::string, std::string_view>;
    variant_t value;

    std::string_view view() const noexcept
    {
        /**
         * Note: noexcept since it is effectively impossible to
         * make this particular variant valueless_by_exception
         */
        return std::visit([](auto const& v) {
              return std::string_view(v); }, value);
    }
};

int main()
{
    OptOwnString owning { std::string("foo") };
    std::cout << owning.view() << '\n';
    OptOwnString borrowed { owning.view() };
    std::cout << borrowed.view() << '\n';
}
Homer512
  • 9,144
  • 2
  • 8
  • 25
  • 1
    `view()` could be `const noexcept`, and to reduce emitted code size, you could use `*std::get_if(value)`. – Jan Schultke Aug 14 '23 at 21:02
  • 1
    Or, much shorter implementation of `view` (which is also more resilient to future changes): `return std::visit([](auto const& v) { return std::string_view(v); }, value);` Which, if you happen to have a function object lying around that does `static_cast`, can read much nicer: `std::visit(static_cast_, value)` – Barry Aug 15 '23 at 02:23
1

You could just use std::string

Firstly, a std::string could be used in a Token just as well as a std::string_view. This might not be as costly as you think, because std::string in all C++ standard libraries has SSOs (small string optimizations).

This means that short tokens like "const" wouldn't be allocated on the heap; the characters would be stored directly inside the container. Before bothering with std::string_view and std::variant, you might want to measure whether allocations are even being a performance issue. Otherwise, this is a case of premature optimization.

If you insist on std::variant ...

User @Homer512 has provided a solid solution already. Rather than using the std::variant directly, you could create a wrapper around it which provides a string-like interface for both std::string and std::string_view.

This is easy to do, because the name and meaning of most member functions is identical for both classes. That also makes them easy to use through std::visit.

struct MaybeOwningString
{
    using variant_type = std::variant<std::string, std::string_view>;
    using size_type = std::string_view::size_type;

    variant_type v;

    // main member function which grants access to either alternative as a view
    std::string_view view() const noexcept {
        return std::visit([](const auto& str) -> std::string_view {
            return str;
        }, v);
    }

    // various helper functions which expose commonly used member functions
    bool empty() const noexcept {
        // helper functions can be implemented with std::visit, but this is verbose
        return std::visit([](const auto& str) {
            return str.empty();
        }, v);
    }

    size_type size() const noexcept {
        // helper functions can also be implemented by using view()
        return view().size();
    }

    // ...
};
Jan Schultke
  • 17,446
  • 6
  • 47
  • 96