1

My goal is to do string-interning. For this I am looking for a hashed container class that can do the following:

  • allocate only one block of memory per node
  • different userdata size per node

The value type looks like this:

struct String
{
    size_t refcnt;
    size_t len;
    char data[];
};

Every String object will have a different size. This will be accomplished with opereator new + placement new. So basically I want to allocate the Node myself and push it in the container later.

Following containers are not suitable:

  • std::unordored_set
  • boost::multi_index::*

    Cannot allocate different sized nodes

  • boost::intrusive::unordered_set

    Seems to work at first. But has some drawbacks. First of all you have to allocate the bucket array and maintain the load-factor yourself. This is just unnecessary and error-prone.

    But another problem is harder to solve: You can only search for objects that have the type String. But it is inefficient to allocate a String everytime you look for an entry and you only have i.e. a std::string as input.

Are there any other hashed containers that can be used for this task?

  • Does C++ even support open structures? I thought that this was a C only thing. – Šimon Tóth Dec 04 '12 at 10:52
  • You could keep the string data in a separate store and just have a pointer or reference to it. – Bo Persson Dec 04 '12 at 10:55
  • possible duplicate of [How can I do string interning in C or C++?](http://stackoverflow.com/questions/10634918/how-can-i-do-string-interning-in-c-or-c) – Christian.K Dec 04 '12 at 10:56
  • The idea to allocate the string inline with the container would throw a monkey wrench into an attempt to bucket your content. Besides, you will have to deal with fragmentation of your container, unless I misunderstood the intent behind your `refcnt` field. To that end, modern memory allocators will outdo almost any seemingly clever trick that programmers can try, so I would go with multiple allocations and `std::unordored_set`. – Sergey Kalinichenko Dec 04 '12 at 11:03
  • The idea to allocate only once for each node is not new and heavily used by mulit_index_container. My use case just expands this with a variable sized extra store. – Jörg Richter Dec 04 '12 at 13:53

2 Answers2

0

I don't think you can do that with any of the standard containers.

What you can do is store the pointer to String and provide custom hash and cmp functors

struct StringHash
{
   size_t operator() (String* str)
  {
    // calc hash
  } 
};

struct StringCmp
{
   bool operator() (String* str1, String* str2)
  {
    // compare
  } 
};

std::unordered_set<String*, StringHash, StringCmp> my_set;
  • I know, but this does two allocations per node. I dont want this. – Jörg Richter Dec 04 '12 at 10:55
  • @JörgRichter, what you mean by `two allocations per node`? –  Dec 04 '12 at 10:59
  • unordered_set has an internal node for every element. The pointer to String together with some internal variables. And the String itself has a memory block. There is always an 1:1 relationship. I want that the internal node and String share one memory block. This is what the intrusive containers in boost are doing. But see above for the problems I have with them. – Jörg Richter Dec 05 '12 at 07:42
0

Your definition for String won't compile in C++; the obvious solution is to replace the data field with a pointer (in which case, you can put the structures themselves in std::unordered_set).

It's possible to create an open ended struct in C++ with something like the following:

struct String
{
    int refcnt;
    int len;
    char* data()
    {
        return reinterpret_cast<char*>(this + 1);
    }
};

You're skating on thin ice if you do, however; for types other than char, there is a risk that this + won't be appropriately aligned.

If you do this, then your std::unordered_set will have to contain pointers, rather than the elements, so I doubt you'll gain anything for the effort.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • The solution with a pointer does not 'allocate only one block of memory per node' – Jörg Richter Dec 04 '12 at 14:15
  • @JörgRichter How many blocks per node are allocated depends on the implementation. What is certain is that no generic container can contain variable length elements. And that the runtime needed to allocate a fixed size node is very, very close to 0 if the implementation is done correctly. – James Kanze Dec 05 '12 at 22:43
  • With boost-intrusive containers it is easy to create variable length elements. But it has other drawbacks that prevents me from using it. But at least I have not overlooked something before implementing my own container. BTW it was easier then I thought and it works now like a charm. – Jörg Richter Dec 06 '12 at 09:33