Why multiset keeps separate instances of repeated elements instead of their count?

Question

I recently found out that multiset<T> implementation in STL actually keeps different copies of the same repeated elements in the tree. My expectation before was for it to internally use a map<T, int> and just keep the count of the repeated elements.

What is the scenario where this implementation can be beneficial compared to just keeping the count? Is there any use case for multiset where the code will break if the internal implementation changes? Or is there any operation which will increase in complexity if it changes?

I was wondering what is the thought process behind that choice?

Two objects might be considered equivalent for the purpose of storage in a set, but still distinguishable from each other. You could, for example, define the comparator for the set to only compare certain members of a class. — ph3rin, Jul 14 '21 at 05:14
Because otherwise it would be much more complicated to iterate over such a container. — Igor R., Jul 14 '21 at 05:24
@ÖöTiib That is what I did after I saw how it acts, just was curious what is the logic behind it and the answer helped me with that. — Scarlet, Jul 14 '21 at 14:15

463035818_is_not_an_ai · Accepted Answer · 2021-07-14T07:18:02.043

By default std::multiset uses operator< to decide if two elements are equivalent (note: equiavent not equal!). Now consider this element type:

struct foo { 
    int x;
    int y;
    bool operator<(const foo& other) { return x < other.x; }
    bool operator==(const foo& other) { return (x==other.x) && (y==other.y);
};

Two elements are considered equivalent when !(a < b) && !(b < a), that does in general not imply that a == b. Hence, storing only the count is not sufficient.

Moreoever, even equality does not imply identity (a==b does not imply &a==&b). As the container owns its elements, two elements in the container cannot possibily be identical, they are always two different objects. Also, in the above foo we could add a member z that is neither considered for operator< nor for operator== but can be accessed via a getter.

The reason sorted containers typically check equivalence not equality is that they need a way to compare elements anyhow (they are sorted). Being able to compare two elements via < is sufficient to check equivalence, while requiring == in addition would make the container less generic without apparent gain. If you like you can always use a comparator that is such that equivalence does imply equality (in the above example: compare also y in operator<).

As already mentioned in a comment, if you want to have the behavior you describe you could use a std::map<foo,size_t> (uses operator< as default comparator for the keys as well). Instead of only storing the foos you can store a pair of foo and the count. Now, because two foos with same x are considered equivalent, they do map to the same key:

 foo f,g;
 f.x = 42; f.y = 0;
 g.x = 42; g.y = 100;
 std::map<foo,size_t> counter;
 ++counter[f];
 ++counter[g];

As a result, counter will hold 1 element: a copy of f and the count for that key will be 2.

Thank you for the through answer, I hadn't thought about the equality vs identity. But do you think CPP can have exception to the template for primitive types the same way it does for ```vector```? In primitive types equality does mean identity, right? — Scarlet, Jul 14 '21 at 14:12
@Scarlet what exception would you like to have? The actual reason `multiset` stores multiple copies of equal elements, is because thats what it is designed for. Also `vector` turned out to be a very bad idea. I hope there won't be more mistakes like `vector`. — 463035818_is_not_an_ai, Jul 14 '21 at 14:17
My point is for a primitive type (as far as I can tell), there is no way to tell from the outside if the container is actually keeping multiple copies or not. So the ```mulitset``` can do exactly what it does with strictly better performance in those cases. (Not criticizing CPP, just trying to learn the logic and thought process) — Scarlet, Jul 14 '21 at 14:23
@Scarlet your reasoning is almost right, you just miss that when you are using a `multiset` then you are most likely using the wrong container. As you realized the way a mutliset works this would be rather inefficient, but you can use a `std::map` to count integers (see answer). If you look carefully at the interface of `multiset` you will notice that it isnt possible to implement it the way you want, as an example https://en.cppreference.com/w/cpp/container/multiset/equal_range. — 463035818_is_not_an_ai, Jul 14 '21 at 14:28

score 3 · Answer 2 · answered Jul 14 '21 at 06:52

A handful of situations where this could be an issue:

Your comparator doesn't distinguish between objects that are not equal.

Let's say I have a bunch of requests that need to be filled in order of their priority. I might write a comparator that just compares the priority of the requests. I would be unhappy if the result was just a structure which counted how many requests of each priority there were, with no way of accessing the actual requests.

Your multiset needs to own the objects

If the multiset holds actual objects, rather than numbers or pointers, what happens if you insert a new object that compares equal to an existing one? Should we just destruct the object that's already in there? The one we're adding? Should it be an expected consequence of inserting something into a collection that it might destruct immediately, or that it might destruct at some point in the future as a consequence of something being inserted?

(To be fair this can actually happen with containers when they reallocate, but it won't happen to objects that can be moved. And of course, there are objects that can be moved but not copied, e.g. objects that own a buffer.)

You want your multiset to be efficient

Comparators tell us whether one object is less than another. In order to determine whether two objects are equal, we'd have to do a second comparison, checking whether a < b and then whether b < a. The existing implementation avoids having to do this second check. Might not make a big performance difference in many cases, but if your data involves a lot of duplicates of the same few equivalence classes, this could essentially slow your code down by a factor of 2.

Why multiset keeps separate instances of repeated elements instead of their count?

2 Answers2