0

I have the following program. I built it with gcc-4.9.2 under linux. My questions are:

1) Why does the hashtable seem to be sorted the first time around, but loses the sort after the items are deleted from value?

2) How do I walk the hashtable by key myself and say std::cout each item that hashes to a bucket, e.g., the code in the #if 0 #endif section?

#include <vector>
#include <iostream>
#include <vector>
#include <functional>

#include <boost/intrusive/unordered_set.hpp>

namespace bic = boost::intrusive;

std::hash<std::string> hash_fn;

struct MyClass : bic::unordered_set_base_hook<bic::link_mode<bic::auto_unlink>>
{
    std::string name;
    int anInt1;
    mutable bool bIsMarkedToDelete;

    MyClass(std::string name, int i) : name(name), anInt1(i), bIsMarkedToDelete(false) {}

    bool operator==(MyClass const& o) const
    {
        //return anInt1 == o.anInt1 && name == o.name;
        return name == o.name;
    }

    struct hasher
    {
        size_t operator()(MyClass const& o) const
        {
            return o.anInt1;
            //return hash_fn(o.name);
        }
    };
};

std::ostream& operator << (std::ostream& out, const MyClass& ac)
{
    std::cout << ac.name << " " << ac.anInt1;

    return out;
}

typedef bic::unordered_set<MyClass, bic::hash<MyClass::hasher>, bic::constant_time_size<false> > HashTable;

int main()
{
    std::vector<MyClass> values
    {
        MyClass { "John",     0 },
        MyClass { "Mike",     0 },
        MyClass { "Dagobart", 25 },
        MyClass { "John",     5 },
        MyClass { "Mike",     25 },
        MyClass { "Dagobart", 26 },
        MyClass { "John",     10 },
        MyClass { "Mike",     25 },
        MyClass { "Dagobart", 27 },
        MyClass { "John",     15 },
        MyClass { "Mike",     27 }
    };

    HashTable::bucket_type buckets[100];
    HashTable hashtable(values.begin(), values.end(), HashTable::bucket_traits(buckets, 100));

    std::cout << "\nContents of std::vector<MyClass> values\n";

    for(auto& e: values)
        std::cout << e << " ";

    std::cout << "\nContents of HashTable hashtable\n";

    for(auto& b : hashtable)
        std::cout << b << '\n';

#if 0 // This code won't compile since there is no operator [] for hashtable
    for(int bucket = 0; bucket < 27; bucket++)
    {
        auto hit(hashtable[bucket].rbegin());
        auto hite(hashtable[bucket].rend());

        while (hit != hite)
        {
            MyClass mc = *hit;

            std::cout << mc << " ";

            hit++;
        }

        std::cout << '\n';
    }
#endif // 0

    std::cout << '\n';
    std::cout << "values size first " << values.size() << '\n';
    std::cout << "hash size fist " << hashtable.size() << '\n';

    for(auto& e: values)
        e.bIsMarkedToDelete |= ("Mike" == e.name);

    std::cout << "removing all bIsMarkedToDelete";
    for(auto& e: values)
        if(e.bIsMarkedToDelete)
            std::cout << e << " ";

    std::cout << '\n';

    values.erase(
        std::remove_if(std::begin(values), std::end(values), std::mem_fn(&MyClass::bIsMarkedToDelete)),
                       std::end(values));

    std::cout << "values size now " << values.size() << '\n';
    std::cout << "hash size now " << hashtable.size() << '\n';

    std::cout << "Contents of value after removing elements " << '\n';
    for(auto& e: values)
        std::cout << e << " ";

    std::cout << "\nContents of HashTable hashtable after delete Mike\n";

    for(auto& b : hashtable)
        std::cout << b << '\n';

    std::cout << '\n';

    values.clear();

    std::cout << values.size() << '\n';
    std::cout << hashtable.size() << '\n';

    std::cout << "Done\n";

    int j;
    std::cin >> j;
}
Ivan
  • 7,448
  • 14
  • 69
  • 134
  • 1
    Please make your question titles more informative. Also, please try to restrict yourself to one question per SO question. – Pradhan Mar 23 '15 at 02:10
  • Will do. The really important question is #2. Question #1 is mostly out of curiosity – Ivan Mar 23 '15 at 02:16
  • 1
    You're using `std::cout` instead of `out` in `operator<<` – sehe Mar 23 '15 at 02:21

1 Answers1

1

Your hash and equality are inconsistent, and as such you violate the container invariants:

bool operator==(MyClass const& o) const
{
    //return anInt1 == o.anInt1 && name == o.name;
    return name == o.name;
}

struct hasher
{
    size_t operator()(MyClass const& o) const
    {
        return o.anInt1;
        //return hash_fn(o.name);
    }
};

This would be fine IFF each distinct value of name always hashed to the same bucket. Alas it doesn't: e.g. "Mike" hashes to 3 different values:

    MyClass { "Mike",     0  },
    MyClass { "Mike",     25 },
    MyClass { "Mike",     25 },
    MyClass { "Mike",     27 }

1) Why does the hashtable seem to be sorted the first time around, but loses the sort after the items are deleted from value?

I'm trying to see what you mean, but the output of the program is:

Contents of std::vector<MyClass> values
John Mike Dagobart John Mike Dagobart John Mike Dagobart John Mike 
Contents of HashTable hashtable
Mike 0
John 0
John 5
John 10
John 15
Mike 25
Dagobart 25
Dagobart 26
Mike 27
Dagobart 27

values size first 11
hash size fist 10
removing all bIsMarkedToDeleteMike Mike Mike Mike 
values size now 7
hash size now 7
Contents of value after removing elements 
John Dagobart John Dagobart John Dagobart John 
Contents of HashTable hashtable after delete Mike
Dagobart 25
John 0
Dagobart 26
John 15
John 10
John 5
Dagobart 27

0
0
Done

I'm having to assume the "first time around" would be the part "Contents of HashTable hashtable". Indeed if you look closely that would seem to be "sorted by bucket". It could make a lot of sense that the container is iterated bucket-by-bucket.

The fact that after removal it no longer is might have a n awful lot to do with the fact that your hash/equality implementations don't match (see above).

2) How do I walk the hashtable by key myself and say std::cout each item that hashes to a bucket, e.g., the code in the #if 0 #endif section?

There's no direct (public API) way. You can build a map for debug purposes by using hashtable.bucket(key) though:

Live On Coliru

#include <vector>
#include <iostream>
#include <vector>
#include <map>
#include <functional>

#include <boost/intrusive/unordered_set.hpp>

namespace bic = boost::intrusive;

std::hash<std::string> hash_fn;

struct MyClass : bic::unordered_set_base_hook<bic::link_mode<bic::auto_unlink>>
{
    std::string name;
    int anInt1;
    mutable bool bIsMarkedToDelete;

    MyClass(std::string name, int i) : name(name), anInt1(i), bIsMarkedToDelete(false) {}

    bool operator==(MyClass const& o) const
    {
        return anInt1 == o.anInt1 && name == o.name;
    }

    struct hasher
    {
        size_t operator()(MyClass const& o) const
        {
            return hash_fn(o.name);
        }
    };
};

std::ostream& operator << (std::ostream& out, const MyClass& ac) {
    return out << ac.name << " " << ac.anInt1;
}

typedef bic::unordered_set<MyClass, bic::hash<MyClass::hasher>, bic::constant_time_size<false> > HashTable;

int main()
{
    std::vector<MyClass> values {
        MyClass { "Dagobart", 25 },
        MyClass { "Dagobart", 26 },
        MyClass { "Dagobart", 27 },
        MyClass { "John",     0  },
        MyClass { "John",     10 },
        MyClass { "John",     15 },
        MyClass { "John",     5  },
        MyClass { "Mike",     0  },
        MyClass { "Mike",     25 },
        MyClass { "Mike",     25 },
        MyClass { "Mike",     27 }
    };

    HashTable::bucket_type buckets[100];
    HashTable hashtable(values.begin(), values.end(), HashTable::bucket_traits(buckets, 100));

    std::cout << "\nDebugging buckets of hashtable\n";

    std::multimap<size_t, MyClass const*> debug_map;
    std::transform(hashtable.begin(), hashtable.end(), 
            std::inserter(debug_map, debug_map.end()), 
            [&](MyClass const& mc) { return std::make_pair(hashtable.bucket(mc), &mc); }
        );

    for (auto& entry : debug_map)
        std::cout << "Debug bucket: " << entry.first << " -> " << *entry.second << "\n";
}

Prints

Debugging buckets of hashtable
Debug bucket: 16 -> Mike 27
Debug bucket: 16 -> Mike 25
Debug bucket: 16 -> Mike 0
Debug bucket: 21 -> Dagobart 27
Debug bucket: 21 -> Dagobart 26
Debug bucket: 21 -> Dagobart 25
Debug bucket: 59 -> John 5
Debug bucket: 59 -> John 15
Debug bucket: 59 -> John 10
Debug bucket: 59 -> John 0

Of course the output depends on the actual implementation of std::hash<std::string> and the tuning of the hash-table.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • sehe yep thanks. ucommenting the code in operator== seems to do better. return anInt1 == o.anInt1 && name == o.name; Leaving the code in hasher seems to be ok, return o.anInt1; – Ivan Mar 23 '15 at 02:27
  • BTW, do you know how I can do question #2? In other words, how can I get the "spirit" of the commented out code between #if 0 and #endif to work? – Ivan Mar 23 '15 at 02:27
  • I've been giving that some time. There's no direct (public API) way. You can build a map for debug purposes by using `hashtable.bucket(key)` though. – sehe Mar 23 '15 at 02:28
  • Hmm, gotcha. In your suggestion http://stackoverflow.com/questions/26857832/trying-to-learn-boostintrusive-q2 how do you propose I walk that hashtable? The items hanging off your arrows sorted by bucket? Sorry I may have misunderstood that there is no way? – Ivan Mar 23 '15 at 02:35
  • @Ivan I've just added a demo implementation of that (also **[Live On Coliru](http://coliru.stacked-crooked.com/a/f96954cf72b79bdd)**). Note that all of this is implementation details (it's not usually meaningful to traverse buckets of a hash table, although even the `std::unordered_set/map<>` seem to have fallen into the trap of exposing implementation detail like this (meaning that no standard implementation could e.g. employ_open addressing_ instead of bucket lists)) – sehe Mar 23 '15 at 02:36
  • Ok this seems to work. I had to change the hasher to return o.anInt1; so that the bucket represents the actual value in anInt – Ivan Mar 23 '15 at 02:42
  • The problem is that buckets != hashes. And further more, that change introduced the undefined behaviour (unless you _also_ make equality compare _only_ `anInt1`). By the way I've fixed a little cosmetic issue so that the debug-map doesn't show false duplicates (because they're in `values`) – sehe Mar 23 '15 at 02:44
  • How the hash corresponds to the bucket is implementation defined. It might be 1:1, but it most likely will not be. (Imagine what bucket would be chosen if the hash is 2^31-1) – sehe Mar 23 '15 at 02:45
  • sehe, if I call std::transform(values.begin(), values.end(), std::inserter(debug_map, debug_map.end()), [&](MyClass const& mc) { return std::make_pair(hashtable.bucket(mc), &mc); } ); before deleting "Mike" then again after deleting "Mike" I get nonsense? – Ivan Mar 23 '15 at 02:50
  • What is the question? Does **[this demo](http://coliru.stacked-crooked.com/a/84450d97104f5d8b)** help? – sehe Mar 23 '15 at 02:54
  • sehe, in regards to buckets != hashes, in the link above where you gave the diagram, you say "anInt1 is the hash (the bucket identifier) for an element" However, if I just use anInt as the return value from bool operator== (to coincide with hasher), it hides all the "Mikes" – Ivan Mar 23 '15 at 02:58
  • I'd read up on hashtables once more. Really the short summary is: you want to worry about logically equality only in your code. Let the container implementation deal with buckets (if it uses them at all, indeed). You just provide a hash function that has a good spread and don't worry about buckets. **PS** Note I _never_ said "(the bucket identifier)" because the hash is not the bucket identifier. It's just one bit of information that the implementation can use to locate the matching element quickly, before resorting to the `==` comparison) – sehe Mar 23 '15 at 03:02
  • In order for this to work, I absolutely need that anything that is hanging off of a bucket in a hash represent anInt in MyClass. Otherwise the suggestion on the previous link with your diagram won't work. – Ivan Mar 23 '15 at 03:02
  • It's unclear to me **a.** what _"anything that is hanging off of a bucket in a hash represent anInt in MyClass"_ means (please reword more precisely, preferrably shorter) **b.** what suggestion you are referring to. – sehe Mar 23 '15 at 03:05
  • If you just want a data structure like that, why don't you code it yourself? Have a `std::map >` and make it obey your requirements. The problem seems to be that you want control/access to the exact implementation detail. The usual way is to write it. – sehe Mar 23 '15 at 03:07
  • That said, if yuo can make sure that `hash` returns `[0..nrbuckets)` then you might be safe. It requires testing though, and you might even want to _read_ the code of `intrusive_set` to make sure this assumption holds – sehe Mar 23 '15 at 03:09
  • sehe, the whole idea in my question a bit back that you gave the diagram for, was that the intrusive container would sort the contents of a [very likely] unsorted std::vector. So the std::vector allowed me to find all the items of MyClass that had the same name fast, and the intrusive container would sort all the contents of MyClass by anInt. Two containers performing two very different duties on the same underlying items. Also, if I deleted the items from the std::vector they would also automaticaly dissapear from the intrusive container. Those are the design requirements. – Ivan Mar 23 '15 at 03:14
  • LOL! I am confused. :-))) – Ivan Mar 23 '15 at 03:16
  • Okay. It's probably better to focus on something else here: hash-tables are known as _unordered_ containers for a reason. You don't use them to _sort_ by a hash. Because that's not what a hash is. (A hash is an unpredictable, but repeatable, value with some good distribution characteristics that make it possible to "partially content-address" any item in a set.). In that answer I just noticed the structural similarity of the data structure you were after, and I'm not sure it was clear to me you intended to access the data bucket-by-bucket. In this case I'd write the datastructure ... – sehe Mar 23 '15 at 03:19
  • ... like I just mentioned before or I'd rethink the problem. If it's about indexing, is Boost MultiIndex not the better match? MultiIndex can work nicely if the element type is a (smart) pointer too, so you can still own the element data in the vector like with Boost Intrusive. – sehe Mar 23 '15 at 03:20
  • What is funny is, I wrote it the way you said I should not write it because it was the way I sort of understood it [very naive] and I won the lottery [it is incredible that it works but I am scared because I am not sure it __should__ work] because that way "works" [appearances]. I tried this way because it seemed sooo much simpler and cleaner and I wanted to understand what you were thinking... – Ivan Mar 23 '15 at 03:23
  • All that said. it's probably good to go back to the actual goal (strip the [XY problem](http://meta.stackoverflow.com/questions/66377/what-is-the-xy-problem)). Perhaps you can ask a question about that when you implemented the simplest thing that could possibly work. That way you'll learn more organically, instead of running into surprise brick walls when trying to take advanced "shortcuts" that may not actually fit the whole goal well :( I'm sorry if my answers caused any confusion. Let's go back to the drawing board and _start simple_ – sehe Mar 23 '15 at 03:26
  • Ok I will start a new question by really simplifying the question. Although I have to be clever because there are _three_ _design_ _requirements_ as stated in six comments above, and the THOUGHT POLICE around here wants me to keep my questions to one per post. – Ivan Mar 23 '15 at 03:31