0

I need a data structure that satisfies the following:

  • stores an arbitrary number of elements, where each element is described by 10 numeric metrics
  • allows fast (log n) search of elements by any of the metrics
  • allows fast (log n) insertion of new elements
  • allows fast (log n) removal of elements

And let's assume that the elements are expensive to construct.

I came up with the following plan

  • store all elements in a vector called DATA.
  • use 10 std::sets, one for each of 10 metrics. Each std:set is light-weight, it contains only integers, which are indexes into the vector DATA. The comparison operators 'look up' the appropriate element in DATA and then select the appropriate metric
template&lt int C &gt
struct Cmp
{
    bool operator() (int const a, int const b)
    {
        return ( DATA[a].coords[C] != DATA[b].coords[C] ) 
           ? ( DATA[a].coords[C] &lt DATA[b].coords[C] )
           : ( a &lt b );
    }
};

Elements are never modified or removed from a vector. A new element is pushed back to DATA and then its index (DATA.size()-1) is inserted into the sets (set<int, Cmp<..> >). To remove an element, I set a flag in the element saying that it is deleted (without actually removing it from the DATA vector) and then remove the element index from all ten std::sets.

This works fine as long as DATA is a global variable. (It also somewhat abuses the type system by making the templated struct Cmp dependent on a global variable.)

However, I was not able to enclose the DATA vector and std::set's (set<int, Cmp<...> >) inside a class and then 'index' DATA with those std::sets. For starters, the comparison operator Cmp defined inside an outer class has no access to the outer class' fields (so it cannot assess DATA). I also cannot pass the vector to the Cmp constructor because Cmp is being constructed by std::set and std::set expects a comparison operator with a constructor that has no arguments.

I have a feeling I'm working against C++ type system and trying to achieve something that the type system is purposely preventing me from doing. (I'm trying to make std::set depend on a variable that is going to be constructed only at runtime.) And while I understand why the type system might not like what I do, I think this is a legitimate use case.

Is there a way to implement the data structure/class I described above without providing a re-implementation of std::set/red-black tree? I hope there may be a trick I have not thought of yet. (And yes, I know that boost has something, but I'd like to stick to the standard library.)

prajmus
  • 3,171
  • 3
  • 31
  • 41
  • What's approximately the amount of elements `DATA` is going to hold? – 101010 Jun 25 '14 at 19:37
  • Do the metrics yield unique values for each data object? It appears to me you want to implement some sort of indexed (database) table. – moooeeeep Jun 25 '14 at 19:44
  • *"And yes, I know that boost has something, but I'd like to stick to the standard library."* [boost.MultiIndex](http://www.boost.org/doc/libs/1_55_0b1/libs/multi_index/doc/index.html) seems to do exactly what you want.. You'll either have to use a library or effectively write one on your own. – dyp Jun 25 '14 at 19:48
  • (Obviously, you could store pointers instead of integers in the `set`, then you can easily access any properties of the objects in the comparator.) – dyp Jun 25 '14 at 19:50
  • *"because Cmp is being constructed by std::set and std::set expects a comparison operator with a constructor that has no arguments"* Huh? `std::set` has a ctor that takes a comparator (and an allocator). That is, you *can* pass a stateful comparator to this container.. – dyp Jun 25 '14 at 19:55
  • 1
    I know you mentioned the constraints on terms of big O complexity, but have you considered just a vector? If the data type is not to big and the number of elements not to many, there is a lot of very interesting material coming out that shows the contiguous memory layout to be superior in terms of actual time, you get to take advantage of the cache available. Your mileage may vary, measure before got commit. – Niall Jun 25 '14 at 19:58
  • Afaik, a set is meant to store unique keys, what you would need at least are multisets for each dimensions (or are you certain that each element is uniquely projected on each dimension? Then you only need *one* set). Secondly, if you use multiset, you will retrieve lists of elements on each dimension, and you will need to cross them in order to find the element you are interested in. That will put you further away from that log(n) goal. – didierc Jun 25 '14 at 20:14
  • This is what you need: http://en.m.wikipedia.org/wiki/R-tree (just for search) and it's not trivial to implement. – didierc Jun 25 '14 at 20:16
  • **@Niall** - it starts with approximately 100ooo elements and goes up to twice that many **@didierc** about set/multiset - look at the Cmp operator - all elements are unique – user3776658 Jun 25 '14 at 22:36
  • **@dyp** 1) `std::set` constructs the comparator and it won't pass anything to the comparator constructor. 2) I could store pointers but then I cannot keep crap in a vector (because vectors re-allocate). – user3776658 Jun 25 '14 at 22:45
  • @user3776658 There is a video (part of the keynote from Going Native 2012) in which Bjarne shows and discusses this. https://www.youtube.com/watch?v=YQs6IC-vgmo and http://bulldozer00.com/2012/02/09/vectors-and-lists/. The take away here is that `vector` can be more time efficient that people think; but you'll need to make some measurements to check in your case. – Niall Jun 26 '14 at 06:58
  • @user3776658 The *default constructor* of `std::set` default-constructs the comparator. Look at the list of constructors in the Standard, the documentation of your Std Lib implementation or one of the (inofficial) [reference pages](http://en.cppreference.com/w/cpp/container/set/set) -- there is a constructor that takes a comparison function object. – dyp Jun 26 '14 at 09:31
  • @user3776658 I found another video I was looking for but couldn't find earlier; http://channel9.msdn.com/Events/Build/2014/2-661 Herb Sutter goes into some detail about this with very nice graphs, diagrams, explanations etc. from around the 23:30 mark and he picks up Bjarne's material at aroun 46:00 mark. They are talking about linear search being much faster almost always, for several hundreds of thousands of elements. – Niall Jun 26 '14 at 10:03
  • **@Nial** I saw both videos a while back, actually that's one of the reasons why I assumed that all my elements will live in a vector. – user3776658 Jun 26 '14 at 19:56
  • **@dyp** _Obviously, you could store pointers instead of integers in the set_ Actually, this might be a good idea. Part of the problem is that I am assuming that elements live in a vector, and then the `Cmp` operator has to somehow know about this vector (so that it can find the n-th element in the vector). And apparently the type system will not allow that. What my enclosing class could do instead, is allocate a chunk of memory and store the elements in that chunk. `std::set`s could then store pointers to elements (rather than indexes). – user3776658 Jun 26 '14 at 20:04
  • **@dyp** With the pointer technique 1) I still have a nice memory layout (all elements are in a continuous memory region) and 2) the `Cmp` operator does not have to know about the vector. On the negative side 1) pointers are somewhat uglier to debug that integers and 2) I have to do memory management myself. – user3776658 Jun 26 '14 at 20:08

1 Answers1

0

When I read something like "look up foo by a value bar", my initial reaction is to use a map<> or something similar. There are some implications to this though:

  1. Keys in an std::map (or values in an std::set) are unique, so no two elements can share the same key and accordingly no two data objects would be able to have the same metric. If multiple data objects can have the same metric (this isn't clear from your question), using an std::multimap (or std::multiset) would work though.
  2. If the keys are constant and stored within the elements themselves, using a set<data*,cmp> is a common approach. The comparator then just retrieves the according field from the objects and compares them. Lookup then requires creating a temporary object and using find() with it. Some implementations also have an extension that allows searching with a different type, which would make this much easier but also make porting require actual work. It is important that the fields used as keys remain constant though, because if you modify them, you implicitly change the order of the set<>. This is the reason that a set<>'s elements are effectively constant, i.e. even a plain iterator has a constant as value type. If you store pointers though, you can easily get around that, because a constant pointer is something different than a pointer to a constant. Don't shoot yourself into the foot with that!
  3. If the metrics are not so much a property of the objects themselves (or you don't mind redundantly storing them), using an std::map would be a natural choice. Storing the same object under multiple keys, depending on the metric, can be done in separate containers (map<int,data*> c[10];). However, you can do that in a single map using e.g. a pair<metric,value> as key (map<pair<int,int>,data*> c;).
  4. Using a vector<> to store the actual elements and only referencing them as either pointers or indices in a map surely works. I'd take the pointers though, as this is what allows the above approaches using a set or map to work. Without that, the comparator would have to store a reference to the container, where at the moment it just uses the global DATA container. Getting this to work with a vector is tricky though, since it reallocates its elements when growing, as you correctly pointed out. I'd consider a different container type, like std::list or std::deque. The former would allow erasing elements, too, but it has a higher per-element overhead. The latter has a relatively low per-element overhead, only slightly above std::vector. You could then even go so far as to store iterators instead of pointers, which helps debugging provided you use a "checked STL" for that. Still, you will have to do some manual bookkeeping which object is still referenced somewhere and which one isn't.
  5. Instead of using a separate container, you could also allocate the elements dynamically, although that itself has some overhead. If the overhead per element is not an issue, you could then use reference-counted smart pointers. If the application is a one-shot process, you could also use raw pointers and let the OS reclaim the memory on exit.

Note that I assume that storing multiple copies of the data objects is not an option. If that was the case, you could just as well have a map<int,data> m[10];, where each map stores its own copy of the data objects. All the bookkeeping issues would then be resolved, but at the price of a 10x overhead.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • #1 is [false](http://www.cplusplus.com/reference/map/multimap/). As for #2, avoid raw pointers; use smart pointers. #3 is also [not completely accurate](http://www.cplusplus.com/reference/map/multimap/). As for #4, that is an X-Y solution. Also, erasing elements "works" in other containers, but not as quickly. – Qix - MONICA WAS MISTREATED Jun 25 '14 at 19:52
  • Thank you for your comments, @Qix, indeed the wording in #3 was missing. The rest of the points you raised are themselves a bit unclear though, in particular the first one looks as if you hadn't taken the time to comprehend all of what I wrote. – Ulrich Eckhardt Jun 25 '14 at 19:57
  • I'd re-word some of it then. Some of your points seem to be tacked on. – Qix - MONICA WAS MISTREATED Jun 25 '14 at 20:00
  • Well, not so tacked-on that you can't clearly refute them as "false". Seriously, can you be any more specific than that? – Ulrich Eckhardt Jun 25 '14 at 20:06
  • None of your points make any sense. 1) all elements are guaranteed to be unique, check the `Cmp` operator, I posted it for a reason 2) true but not relevant to the question, you could just as well argue I should not free the memory twice 3) ??? 4) You would take pointers or iterators to elements in a vector? So that it's harder to debug? And so that the whole thing blows up as soon as the vector reallocates? And you finish off by saying that I should encapsulate the whole thing properly. I thought the whole question is about how to encapsulate this (and if it is at all doable) – user3776658 Jun 25 '14 at 22:56
  • I'll try to clarify these points. That said, do you have any idea how rude a reply like "None of your points make any sense." comes across? – Ulrich Eckhardt Jun 27 '14 at 17:39