2

I'm writing this function

double long CosineDistance(const vector<unsigned long>& a,const vector<unsigned long>& b){
  double long num = 0.0, den1 = 0.0, den2 = 0.0 ;
    for(int i = 0; i < a.size(); ++i) {
     num+=a[i]*b[i] ;
     den1+=a[i]*a[i] ;
     den2+=b[i]*b[i] ;
     } 
return num/(sqrt(den1)*sqrt(den2));
}

And it works as it expect with small numbers:

i.e. passing {1,3,8} and {5,4,9} returns 0.936686 (wich is right)

Now the project I'm building uses big numbers (they are hashed strings) and using numbers like

{3337682107,92015386,2479056,2478761,4153082938}

and

{104667454,92015386,150359366,2225484100,2479056}

it returns me 1, which I think is the aproximation of 0.968597, according to WolframAlpha.

Already checked overflow and it's not happening.

Is there a way to fix this?

Thanks

Ghesio
  • 89
  • 11
  • 4
    Are you sure that `3337682107 * 104667454` isn't overflowing? – Rakete1111 May 19 '16 at 18:32
  • You do know that floating point math is not exact - right? Suggested reading: https://en.wikipedia.org/wiki/IEEE_floating_point – Jesper Juhl May 19 '16 at 18:36
  • could it be that you are losing precision in the loop? Why not use a long in for num, den1, and den2? I don't see how those will ever be anything other than a whole number. – JimmyJames May 19 '16 at 18:37
  • @Rakete1111 I'm sure it isn't overflowing, I checked using ``, and yes, I know floating point math is not exact, but 0.04 is a big error. I simply don't know where I'm losing precision – Ghesio May 19 '16 at 19:42
  • what is `sizeof(unsigned long)` on your system? – kmdreko May 19 '16 at 22:30
  • @vu1p3n0x It says 8. So I suppose 32 bit? In case if it is so it's sure overflowing – Ghesio May 20 '16 at 15:42
  • 1
    no, that means 8 bytes so its a 64-bit type. no problems there – kmdreko May 20 '16 at 15:44

4 Answers4

3

When you calculate the cosine similarity between two vectors a and b then the following is true:

CosineDistance(a*x,b*x) == CosineDinstance(a,b);

for any number x (but not 0). Thus you could simply use doubles and an appropriate scaling factor x to avoid overflow.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
0

There are several places that you could be losing precision.

  • When multiplying two very large unsigned longs, it could overflow.
  • When converting unsigned long to long double, the low order bits can essentially be ignored. (truncated)
  • When adding two long doubles, one of which is enough orders of magnitude larger than the other, the smaller will essentially be ignored. If it's merely several orders of magnitude larger, then the low order bits of the smaller one will essentially be ignored.

In your example, the calculation didn't lose much precision, 1 vs .95 is pretty close relatively speaking. If you need the computation not to lose precision at all, one way to do it here would be to use a bignum library like boost::multiprecision. You could, instead of using long double in your code, use an infinite precision rational number like cpp::rational in that library. Then convert it to long double at the time of taking square roots.

If these numbers are hashes of strings as you say, and the values don't have that much significance in and of themselves, (presumably you just want to cluster them or something?) then one thing you could do is choose a hash function that outputs smaller numbers, or, mod those numbers down to be say only 6 digits long. That will greatly reduce the likelihood of losing precision at all.

Chris Beck
  • 15,614
  • 4
  • 51
  • 87
0

The sum of the squares of {3337682107,92015386,2479056,2478761,4153082938} is larger than 2^64 which appears to be the typical max size of the mantissa of double long. Assuming that's the case, you are getting the same precision as you would with an unsigned long which would overflow.

JimmyJames
  • 1,356
  • 1
  • 12
  • 24
0

I checked this using Matlab and C++ (x64 VC2013), for your "big numbers" case, I got an answer of 0.0314034 instead of 0.968597. I used the raw numbers as double instead of converting from int to double.

Here is how I checked things.

#include <cmath>
#include <vector>
#include <iostream>
using namespace std;

double CosineDistance(const vector<double> &a, const vector<double> &b);
long double CosineDistance2(const vector<long double> &a, const vector<long double> &b);
long double Cos2(const vector<unsigned long> &a, const vector<unsigned long> &b);
long double Cos3(const vector<unsigned long> &a, const vector<unsigned long> &b);

int main(int argc, char * argv[]){

    vector<double> a = { 1, 3, 8 };
    vector<double> b = { 5, 4, 9 };

    double v1 = CosineDistance(a, b);

    vector<double> a2 = { 3337.682107, 92.015386, 2.479056, 2.478761, 4153.082938 };
    vector<double> b2 = { 104.667454, 92.015386, 150.359366, 2225.484100, 2.479056 };

    double v2 = CosineDistance(a2, b2);

    vector<double> a3 = { 333.7682107, 9.2015386, .2479056, .2478761, 415.3082938 };
    vector<double> b3 = { 10.4667454, 9.2015386, 15.0359366, 222.5484100, .2479056 };

    double v3 = CosineDistance(a3, b3);

    vector<double> a4 = { .1, .3, .8 };
    vector<double> b4 = { .5, .4, .9 };

    double v4 = CosineDistance(a4, b4);

    vector<long double> a5 = { 3337682107, 92015386, 2479056, 2478761, 4153082938 };
    vector<long double> b5 = { 104667454, 92015386, 150359366, 2225484100, 2479056 };

    long double v5 = CosineDistance2(a5, b5);

    vector<unsigned long> a6 = { 3337682107, 92015386, 2479056, 2478761, 4153082938 };
    vector<unsigned long> b6 = { 104667454, 92015386, 150359366, 2225484100, 2479056 };

    long double v6 = Cos2(a6, b6);
    long double v7 = Cos3(a6, b6);

    cout << v1 << endl;
    cout << v2 << endl;
    cout << v3 << endl;
    cout << v4 << endl;
    cout << v5 << endl;
    cout << v6 << endl;
    cout << v7 << endl;

    return 0;
}

double CosineDistance(const vector<double> &a, const vector<double> &b){

    double num(0.0), den1(0.0), den2(0.0);

    for (unsigned int i = 0; i < a.size(); ++i){
        num += a[i] * b[i];
        den1 += a[i] * a[i];
        den2 += b[i] * b[i];
    }

    double res = num / (sqrt(den1) * sqrt(den2));

    return res;
}

long double CosineDistance2(const vector<long double> &a, const vector<long double> &b){

    long double num(0.0), den1(0.0), den2(0.0);

    for (unsigned int i = 0; i < a.size(); ++i){
        num += a[i] * b[i];
        den1 += a[i] * a[i];
        den2 += b[i] * b[i];
    }

    long double res = num / (sqrt(den1) * sqrt(den2));

    return res;
}

long double Cos2(const vector<unsigned long> &a, const vector<unsigned long> &b){

    vector<long double> ad(a.size());
    vector<long double> bd(b.size());
    for (unsigned int i = 0; i < a.size(); ++i){
        ad[i] = static_cast<long double>(a[i]);
        bd[i] = static_cast<long double>(b[i]);
    }

    long double num(0.0), den1(0.0), den2(0.0);

    for (unsigned int i = 0; i < a.size(); ++i){
        num += ad[i] * bd[i];
        den1 += ad[i] * ad[i];
        den2 += bd[i] * bd[i];
    }

    long double res = num / (sqrt(den1) * sqrt(den2));

    return res;
}

long double Cos3(const vector<unsigned long> &a, const vector<unsigned long> &b){

    long double num(0.0), den1(0.0), den2(0.0);

    for (unsigned int i = 0; i < a.size(); ++i){
        num += a[i] * b[i];
        den1 += a[i] * a[i];
        den2 += b[i] * b[i];
    }

    long double res = num / (sqrt(den1) * sqrt(den2));

    return res;
}

The output is:

0.936686
0.0314034
0.0314034
0.936686
0.0314034
0.0314034
0.581537

Notice that when I specifically convert from unsigned long to long double my answer agrees with both Matlab and my other C++ numbers.

Matt
  • 2,554
  • 2
  • 24
  • 45
  • Your code works well, but inserting it in my project causes some problems (my other numbers are too big). I need to find a way to make hashed numbers smaller. Thank you anyway! – Ghesio May 20 '16 at 15:49
  • Fixed, there was a mismatch in the set of data I was using. Working perfect. – Ghesio May 20 '16 at 18:05