5

One of consequences of the IEEE 754 standard is the non-intuitive behavior of std::unordered_set<double>, when not-a-number elements (NANs) are inserted.

Due to the fact that NAN!=NAN, after the following sequence:

#include <iostream>
#include <cmath>
#include <unordered_set>

int main(){
    std::unordered_set<double> set;
    set.insert(NAN);
    set.insert(NAN);
    std::cout<<"Number of elements "<<set.size()<<"\n";  //there are 2 elements!
}

there are two elements in the set(see it live): NAN and NAN!

Mine main issue with this is, that when N NANs are inserted into the hash-set, they all hit the same hash-bucket and the performance of N inserts into the hash-set degenerates to the worst-case running time - O(N^2).

For an example, see the listing at the end of the question or here live - inserting NAN takes some order of magnitude more time than a "normal" floating number.

My question: is it possible (and if yes - how) to tweak std::unordered_set<double> in such a way, that there is at most one NAN-element in the set, no matter the flavor of inserted NANs (NAN, -NAN and so on)?


Listing:

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}
ead
  • 32,758
  • 6
  • 90
  • 153
  • 2
    Third template parameter of unordered_set is the equal predicate, you can probably get desired behaviour by supplying your own. – Eelke Jun 26 '18 at 12:11
  • 3
    Note that generally using floating point numbers as keys might be risky. For instance, inserting `1.0e50` and `1.0e50+1.0` results in a single elements in the `std::set`. One can find many such examples. – Daniel Langr Jun 26 '18 at 12:27
  • @Eelke I don't think this will be enough: -NAN will have a different hash-value than NAN but for hash-function h must hold: if x==y then also h(x)==h(y) – ead Jun 26 '18 at 12:28
  • Indeed you need to supply custom hashing to, it is the second parameter. – Eelke Jun 26 '18 at 12:59
  • 1
    BTW, using `std::unordered_set` (without a custom equality function) invokes undefined behavior because `std::equal_to` is not an equivalence relation for the precise reason you mention. Similar reasoning applies to all standard associative containers. – Arne Vogel Jun 26 '18 at 14:23

2 Answers2

7

You can define your own predicate to compare the keys, and ensure NaNs compare equal for this purpose. This can be supplied as the third parameter to the std::unordered_set class template.

See the definition of EqualPred below (code copied from question and modified), and its use in declaring the unordered_set variable. I took the default for the second parameter from the documentation at https://en.cppreference.com/w/cpp/container/unordered_set

Live demo: http://coliru.stacked-crooked.com/a/7085936431e6698f

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

struct EqualPred
{
    constexpr bool operator()(const double& lhs, const double& rhs) const
    {
        if (std::isnan(lhs) && std::isnan(rhs)) return true;
        return lhs == rhs;
    }
};

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double, std::hash<double>, EqualPred> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}

It is worth noting (thanks to @ead's comment) that -NaN and +NaN may hash to different values. If you want to handle these as identical, you'll need to provide a different implementation of the second template parameter, the hash function. This should detect any NaNs and hash the same NaN each time.

Andrew
  • 5,212
  • 1
  • 22
  • 40
  • I think the problem with this solution: `-NAN` will have a different hash-value then `NAN` but for hash-function `h` must hold: if `x==y` then also `h(x)==h(y)`. – ead Jun 26 '18 at 12:22
  • @ead: In that case, you also need to specialise the second parameter, the hash, to hash the same NaN for any input NaNs. Note that `x == y` doesn't hold true if x, y or both are NaN, anyway. – Andrew Jun 26 '18 at 13:44
  • But you redefine the relation `x==y` with your predicate, i.e. `x==y` means `EqualPred(x,y)` for the hash-set. – ead Jun 26 '18 at 13:56
  • 1
    @ead, when inserting equality is only checked when the hash are the same value (which can happen even for different values). It then searches the bucket of same hash values using the equality comparator to see if the value already exists. If there is no other hash with the same value, there is no need to check equality, as its implicitly not equal to anything already in the set due to its unique hash value. – rmawatson Jun 26 '18 at 14:43
3

From your comment in Andrews answer,

I think the problem with this solution: -NAN will have a different hash-value then NAN but for hash-function h must hold: if x==y then also h(x)==h(y)

This does hash differently, so you need to also define your own hash function if you want h(-NAN) == h(NAN) ...

(augmented from @Andrew answer)

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

struct EqualPred
{
    constexpr bool operator()(const double& lhs, const double& rhs) const
    {
        if (std::isnan(lhs) && std::isnan(rhs)) return true;
        return lhs == rhs;
    }
};

template <typename T>
struct Hash
{
    size_t operator()(const T& value) const
    {
        return  std::hash<T>()( std::isnan(value) ? NAN : value);
    }


};

std::unordered_set<double, Hash<double>, EqualPred> s;

constexpr int N=5000;
void test_insert(double value)
{

    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);  
    test_insert(-NAN);

    std::cout << s.size() << std::endl;
    //takes 0.2 s
}

Demo

rmawatson
  • 1,909
  • 12
  • 20
  • This is what I have being looking for! – ead Jun 26 '18 at 13:39
  • Wouldn't it be better (in that it makes fewer assumptions) to let it store `-NAN` distinct from `NAN`? In that case it could use `std::hash` but then a custom `KeyEqual` that ensures `0 == -0` but otherwise does bitwise equality? – Ben Apr 23 '19 at 17:21