implementing a hash table-like data structure with floating point keys where values within a tolerance are binned together

Question

I need an associative data structure with floating point keys in which keys with nearly equal values are binned together. I'm working in C++ but language doesnt really matter.

Basically my current strategy is to

only handle single precision floating point numbers
use an unordered_map with a custom key type
define the hash function on the key type as

a. given float v divide v by some tolerance, such as 0.0005, at double precision, yielding k.

b. cast k to a 64 bit integer yielding ki

c. return std::hash of ki.

First of all, is there a standard named data structure that does something like this? If not is there a better way to do this than my general approach?

The main thing i do not like about the following implementation is that it is unintuitive to me which floating point values will be binned together; I cope with this by having a general sense of which values in my input I want to count as the same value and just test various tolerances but it would be nice that if you added 12.0453 to the container then values 12.0453 +/- 0.0005 would be considered equal if the tolerance parameter is 0.0005 but this is not the case -- I don't even think such behavior would be possible on top of unordered_map because I think the hash function would then be dependent on the values in the table.

Basically my implementation is dividing the number line into a 1D grid in which each grid cell is epsilon units wide and then assigning floating point values to the zero-based index of the grid cell they fall into. My question is, is there a better away to implement an associative container of floating point values with tolerance that is also O(1)? and are there problems with the implementation below?

    template<typename V, int P=4>
    class float_map
    {
    private:
        struct key {
        public:
            long long val;

            static constexpr double epsilon(int digits_of_precision)
            {
                return (digits_of_precision == 1) ? 0.5 : 0.1 * epsilon(digits_of_precision - 1);
            }

            static constexpr double eps = epsilon(P);

            key(float fval) : val(static_cast<long long>( fval / eps))
            {}

            bool operator==(key k) const {
                return val == k.val;
            }
        };

        struct key_hash
        {
            std::size_t operator()(key k) const {
                return std::hash<long long>{}(k.val);
            }
        };

        std::unordered_map<key, V, key_hash> impl_;

    public:
        V& operator[](float f) {
            return impl_[key(f)];
        }

        const V& at(float f) const {
            return impl_.at(key(f));
        }

        bool contains(float f) const {
            return impl_.find(f) != impl_.end();
        }

        double epsilon() const {
            return key::eps;
        }
    };

    int main()
    {
        float_map<std::string> test;

        test[12.0453f] = "yes";

        std::cout << "epsilon = " << test.epsilon() << std::endl;                             // 0.0005

        std::cout << "12.0446f => " << (test.contains(12.0446f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0447f => " << (test.contains(12.0447f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0448f => " << (test.contains(12.0448f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0449f => " << (test.contains(12.0449f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0450f => " << (test.contains(12.0450f) ? "yes" : "no") << std::endl; // yes
        std::cout << "12.0451f => " << (test.contains(12.0451f) ? "yes" : "no") << std::endl; // yes
        std::cout << "12.0452f => " << (test.contains(12.0452f) ? "yes" : "no") << std::endl; // yes
        std::cout << "12.0453f => " << (test.contains(12.0453f) ? "yes" : "no") << std::endl; // yes
        std::cout << "12.0454f => " << (test.contains(12.0454f) ? "yes" : "no") << std::endl; // yes
        std::cout << "12.0455f => " << (test.contains(12.0455f) ? "yes" : "no") << std::endl; // yes
        std::cout << "12.0456f => " << (test.contains(12.0456f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0457f => " << (test.contains(12.0457f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0458f => " << (test.contains(12.0458f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0459f => " << (test.contains(12.0459f) ? "yes" : "no") << std::endl; // no
        std::cout << "12.0460f => " << (test.contains(12.0460f) ? "yes" : "no") << std::endl; // no

    }

May I ask why the keys need to be floating point? I believe that will help me answer the question — Paul Renton, Nov 07 '19 at 23:16
Because floating point data is what I have. I'm working with data I did not create. I need to be able to associate data structures with 2D points with single precision x and y and need to be able to be to look up points in O(1) time however I do not want to assume that if I see (10.3333, 4.0) in the data the same point will never be refered to as (10.3333, 3.9999999) etc. — jwezorek, Nov 07 '19 at 23:22
Okay thanks, let me think about this one. So, your key is basically a Vec2 of floats that is associated with another data structure? — Paul Renton, Nov 07 '19 at 23:24
You need to draw the line somewhere right?: if epsilon is 0.0004, and you first add 12.0453, then 12.0448, where would the 12.0450 go? to 12.0453 because it was first or to 12.0448 because is closer... ? — Fusho, Nov 07 '19 at 23:30
Aaaah, you are looking at 2d points... Take a look at: https://en.wikipedia.org/wiki/Geohash — Dav3xor, Nov 07 '19 at 23:31

jwezorek · Accepted Answer · 2020-11-02T17:56:00.850

The best way to do this is to use fixed point arithmetic.

The implementation in the question details works but is more obfuscated than it needs to be. What it treats as an epsilon or a tolerance is actually a "bin width" -- a one-dimensional spacing between grid lines partitioning the real number line -- and thus if you are expecting the epsilon value to act like a tolerance you will notice counter-intuitive behavior around the edges of bins / near grid lines.

In any case a clearer way to think about this problem is to not try to use a notion of "tolerance" but instead use the notion of "significant digits". Treat only n base-10 digits right of the decimal as mattering and parametrize on that n. What this results in essentially is using fixed point values as keys rather than floating point values; in the above implementation it is akin to using an epsilon of 0.0001 instead of 0.0005.

But rather than just modifying the epsilon in the original code, there is now no reason to not just make the fixed point values a public type and using that type as the key of an unordered_map exposed to the user. Previously we wanted to hide the key type by wrapping the implementation's unordered_map in a custom data structure, because in that case the keys were opaque, didn't have an intuitive meaning. Using fixed point keys in a normal unordered_map has the side benefit of making it such that we do not have to implement wrapper methods for all the standard std::unordered_map calls since the user is now given an actual unordered_map.

code below:

template<int P=4>
class fixed_point_value
{
    static constexpr double calc_scaling_factor(int digits_of_precision)
    {
        return (digits_of_precision == 1) ? 10.0 : 10.0 * calc_scaling_factor(digits_of_precision - 1);
    }

    static constexpr double scaling_factor = calc_scaling_factor(P);

    template<int P>
    friend struct fixed_point_hash;

public:
    fixed_point_value(float val) : 
        impl_(static_cast<long long>(std::llround(scaling_factor * val)))
    {}

    bool operator==(fixed_point_value<P> fpv) const 
    {
        return impl_ == fpv.impl_;
    }

    float to_float() const
    {
        return static_cast<float>(impl_ / scaling_factor);
    }

private:
    long long impl_;
};

template<int P = 4>
struct fixed_point_hash
{
    std::size_t operator()(fixed_point_value<P> key) const {
        return std::hash<long long>{}(key.impl_);
    }
};

template<typename V, int P = 4>
using fixed_point_table = std::unordered_map<fixed_point_value<P>, V, fixed_point_hash<P>>;

int main()
{
    fixed_point_table<std::string, 4> t4;

    t4[12.0453f] = "yes";

    // these will all be "no" except 12.0453f because we have 4 base-10 digits of precision i.e.
    // 4 digits right of the decimal must be an exact match
    std::cout << "precision = 4" << std::endl;
    std::cout << "12.0446f => " << (t4.find(12.0446f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0447f => " << (t4.find(12.0447f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0448f => " << (t4.find(12.0448f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0449f => " << (t4.find(12.0449f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0450f => " << (t4.find(12.0450f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0451f => " << (t4.find(12.0451f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0452f => " << (t4.find(12.0452f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0453f => " << (t4.find(12.0453f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0454f => " << (t4.find(12.0454f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0455f => " << (t4.find(12.0455f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0456f => " << (t4.find(12.0456f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0457f => " << (t4.find(12.0457f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0458f => " << (t4.find(12.0458f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0459f => " << (t4.find(12.0459f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "12.0460f => " << (t4.find(12.0460f) != t4.end() ? "yes" : "no") << std::endl;
    std::cout << "\n";

    fixed_point_table<std::string, 3> t3;
    t3[12.0453f] = "yes"; // 12.0453 will round to the fixed point value 12.045.
    std::cout << "precision = 3" << std::endl;
    std::cout << "12.0446f => " << (t3.find(12.0446f) != t3.end() ? "yes" : "no") << std::endl; // rounds to 12.045 so yes;
    std::cout << "12.0447f => " << (t3.find(12.0447f) != t3.end() ? "yes" : "no") << std::endl; // rounds to 12.045 so yes;
    std::cout << "12.0448f => " << (t3.find(12.0448f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0449f => " << (t3.find(12.0449f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0450f => " << (t3.find(12.0450f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0451f => " << (t3.find(12.0451f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0452f => " << (t3.find(12.0452f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0453f => " << (t3.find(12.0453f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0454f => " << (t3.find(12.0454f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0455f => " << (t3.find(12.0455f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0456f => " << (t3.find(12.0456f) != t3.end() ? "yes" : "no") << std::endl; // 12.0456f rounds to the 3 digits of precison fixed point value 12.046 so no
    std::cout << "12.0457f => " << (t3.find(12.0457f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0458f => " << (t3.find(12.0458f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0459f => " << (t3.find(12.0459f) != t3.end() ? "yes" : "no") << std::endl; // '
    std::cout << "12.0460f => " << (t3.find(12.0460f) != t3.end() ? "yes" : "no") << std::endl; // '

}

score 1 · Answer 2 · answered Nov 07 '19 at 23:11

1

Hmmm, maybe you could use an unordered_map keyed with an integer, and determine the key with something like:

key = floor(val/precision);

This is reasonably transparent, and key 0 would contain values from 0.0 to 0.0005 (or whatever your precision is). Also, negative numbers would work logically in this as well.

If you want to do this with 2 dimensional values, you might want to look into geohashes.

answered Nov 07 '19 at 23:11

Dav3xor

386
2
6

that is pretty much what I'm doing. – jwezorek Nov 07 '19 at 23:12
Fair enough, I was led off track by calling the recursive function for every key. Figured you were doing something more clever. Hmmm, you are still calling std::hash on your key value. Would unordered_map call this anyway on an atomic type? – Dav3xor Nov 07 '19 at 23:25
the recursive function is constexpr. I had to do it that way because you cannot parametrize a C++ template on a floating point value. The key type of the internal unorded_map is custom the custom type float_map::key so it will just use the hash that I provide. I call std::hash myself in the key_hash() because otherwise the result of the hash will literally be floor(val/precision) which will then be directly used in the unordered_map implementation rather that the hash of the integer as std::unordered_map would use. – jwezorek Nov 07 '19 at 23:30
1

Argh, templates. condolences. – Dav3xor Nov 07 '19 at 23:41
lol, yeah. would be a lot clearer if you could parametrize on a float. anyway I updated the code. Should be clearer that epsilon is a compile time constant now. I'm not actually super familar with constexpr stuff either – jwezorek Nov 07 '19 at 23:48
1

@jwezorek: For the future: C++20 does allow floating-point template parameters. – Davis Herring Nov 01 '20 at 19:32

score 1 · Answer 3 · answered Nov 13 '19 at 18:14

Simply binning data points together can't possibly give you what you want, because there will always be points very close together on either side of a bin boundary. You need to use some other method.

For instance:

Let's say you divide your domain into squares of side epsilon. Then you can build an std::map that assigns each data point to a square; and given an arbitrary point P=(x,y), you can find the square S(P) that contains P. Now what you have to do is look at all nine squares in a 3x3 grid containing S(P) as the central square. Then you can scan those nine bins for the closest data point to P.

This method is guaranteed to find a point within a distance epsilon from (x,y), if one exists.

implementing a hash table-like data structure with floating point keys where values within a tolerance are binned together

3 Answers3

Linked