Handling loss of precision in subtracting two doubles that are close to each other

Question

I have a project to do where we are to solve the matrix equation AX=B for x, given that A is a tridiagonal matrix. I did this project in C++, got the program to produce the right Matrix X, but when trying to report the error back to the user, A*X-B, I get an erroneous error!! It is due to the fact that I am subtracing A*X and B, whose entries are arbitrarily close to each other. I had two ideas on how to handle this, element-by-element:

According to this article, http://en.wikipedia.org/wiki/Loss_of_significance, there could be as many as -log2(1-y/x) bits lost in straight subtraction x-y. Let's scale both x and y by pow(2,bitsLost), subtract the two, and then scale them back down by dividing by pow(2,bitsLost)
Stressed so much in the numeric methods course this is for: take the arithmetic conjugate! Instead of double difference = x-y; use double difference = (x*x-y*y)/(x+y);

OK, so why haven't you chose a method and moved on?

I tried all three methods (including straight subtraction) here: http://ideone.com/wfkEUp . I would like to know two things:

Between the "scaling and descaling" method (for which I intentionally chose a power of two) and the arithmetic conjugate method, which one produces less error (in terms of subtracting the large numbers)?
Which method is computationally more efficient? /*For this, I was going to say the scaling method was going to be more efficient with a linear complexity versus the seemed quadratic complexity of the conjugate method, but I don't know the complexity of log2()*/

Any and all help would be welcomed!!

P.S.: All three methods seem to return the same double in the sample code...

Let's see some of your code No problem; here is my Matrix.cpp code

#include "ExceptionType.h"
#include "Matrix.h"
#include "MatrixArithmeticException.h"
#include <iomanip>
#include <iostream>
#include <vector>

Matrix::Matrix()
{
    //default size for Matrix is 1 row and 1 column, whose entry is 0
    std::vector<long double> rowVector(1,0);
    this->matrixData.assign(1, rowVector);
}

Matrix::Matrix(const std::vector<std::vector<long double> >& data)
{
    this->matrixData = data;
    //validate matrixData
    validateData();
}

//getter functions
//Recall that matrixData is a vector of a vector, whose elements should be accessed like matrixData[row][column].
//Each rowVector should have the same size.
unsigned Matrix::getRowCount() const { return matrixData.size(); }

unsigned Matrix::getColumnCount() const { return matrixData[0].size(); }

//matrix validator should just append zeroes into row vectors that are of smaller dimension than they should be...
void Matrix::validateData()
{
    //fetch the size of the largest-dimension rowVector
    unsigned largestSize = 0;
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        if (largestSize < matrixData[i].size())
            largestSize = matrixData[i].size();
    }
    //make sure that all rowVectors are of that dimension
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        //if we find a rowVector where this isn't the case
        if (matrixData[i].size() < largestSize)
        {
            //add zeroes to it so that it becomes the case
            matrixData[i].insert(matrixData[i].end(), largestSize-matrixData[i].size(), 0);
        }
    }

}
//operators
//+ and - operators should check to see if the size of the first matrix is exactly the same size as that of the second matrix
Matrix Matrix::operator+(const Matrix& B)
{
    //if the sizes coincide
    if ((getRowCount() == B.getRowCount()) && (getColumnCount() == B.getColumnCount()))
    {
        //declare the matrixData
        std::vector<std::vector<long double> > summedData = B.matrixData;    //since we are in the scope of the Matrix, we can access private data members
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < getColumnCount(); j++)
            {
                summedData[i][j] += matrixData[i][j];   //add the elements together
            }
        }
        //return result Matrix
        return Matrix(summedData);
    }
    else
        throw MatrixArithmeticException(DIFFERENT_DIMENSIONS);
}

Matrix Matrix::operator-(const Matrix& B)
{
    //declare negativeB
    Matrix negativeB = B;
    //negate all entries
    for (unsigned i = 0; i < negativeB.getRowCount(); i++)
    {
        for (unsigned j = 0; j < negativeB.getColumnCount(); j++)
        {
            negativeB.matrixData[i][j] = 0-negativeB.matrixData[i][j];
        }
    }
    //simply add the negativeB
    try
    {
        return ((*this)+negativeB);
    }
    catch (MatrixArithmeticException& mistake)
    {
        //should exit or do something similar
        std::cout << mistake.what() << std::endl;
    }
}

Matrix Matrix::operator*(const Matrix& B)
{
    //the columnCount of the left operand must be equal to the rowCount of the right operand
    if (getColumnCount() == B.getRowCount())
    {
        //if it is, declare data with getRowCount() rows and B.getColumnCount() columns
        std::vector<long double> zeroVector(B.getColumnCount(), 0);
        std::vector<std::vector<long double> > data(getRowCount(), zeroVector);
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < B.getColumnCount(); j++)
            {
                long double sum = 0; //set sum to zero
                for (unsigned k = 0; k < getColumnCount(); k++)
                {
                    //add the product of matrixData[i][k] and B.matrixData[k][j] to sum
                    sum += (matrixData[i][k]*B.matrixData[k][j]);
                }
                data[i][j] = sum;   //assign the sum to data
            }
        }
        return Matrix(data);
    }
    else
    {
        throw MatrixArithmeticException(ROW_COLUMN_MISMATCH); //dimension mismatch
    }
}

std::ostream& operator<<(std::ostream& outputStream, const Matrix& theMatrix)
{
    //Here, you should use the << again, just like you would for ANYTHING ELSE.
    //first, print a newline
    outputStream << "\n";
    //setting precision (optional)
    outputStream.precision(11);
    for (unsigned i = 0; i < theMatrix.getRowCount(); i++)
    {
        //print '['
        outputStream << "[";
        //format stream(optional)
        for (unsigned j = 0; j < theMatrix.getColumnCount(); j++)
        {
            //print numbers
            outputStream << std::setw(17) << theMatrix.matrixData[i][j];
            //print ", "
            if (j < theMatrix.getColumnCount() - 1)
                outputStream << ", ";
        }
        //print ']'
        outputStream << "]\n";
    }
    return outputStream;
}

Method 1 will not help anything. If you scale both numbers, `-log2(1-y/x)` won't change, because `y/x` won't change. — leemes, Dec 01 '13 at 20:09
Do you mean 'whose *values* are arbitrarily close to each other'? — user207421, Dec 01 '13 at 20:10
Method 2 won't help either... Taking the difference of two numbers will give you the most accurate result possible to be represented in a `double`. So any other formula than `x-y` is at best equal to `x-y`, never better. It's how the numbers `x` and `y` are stored in the first place... — leemes, Dec 01 '13 at 20:11
@Yakk Is that question for me? If so, then yes, I have not only heard of it, but have experience with it. I have stressed it a lot as a math tutor.... — Mike Warren, Dec 01 '13 at 21:57
@leemes So basically that "big idea" in numeric methods is a wrong idea? — Mike Warren, Dec 01 '13 at 21:57
It seems as if I misinterpreted the "big idea": it applied to **function assignments**. — Mike Warren, Dec 02 '13 at 02:12
well, use matrixes of intervals: so your error is known. If that fails, arbitrary precision. If that fails, pull based precision. In increasing obscurity... — Yakk - Adam Nevraumont, Dec 02 '13 at 03:23

score 2 · Accepted Answer · edited Mar 05 '20 at 13:39

2

You computed two numbers x and y which are of a limited precision floating point type. This means that they are already rounded somehow, meaning loss of precision while computing the result. If you subtract those numbers afterwards, you compute the difference between those two already rounded numbers.

The formula you wrote gives you the maximum error for computing the difference, but this error is with regard to the stored intermediate results x and y (again: rounded). No method other than x-y will give you a "better" result (in terms of the complete computation, not only the difference). To put it in a nutshell: the difference can't be more accurate using any formula other than x-y.

I'd suggest taking a look at arbitrary precision arithmetic math libraries like GMP or Eigen. Use such libraries for computing your equation system. Don't use double for the matrix computations. This way you can make sure that the intermediate results x and y (or the matrices Ax and B) are as precise as you want them to be, for example 512 bits, which should definitely be enough for most cases.

edited Mar 05 '20 at 13:39

Hari

1,561
4
17
26

answered Dec 01 '13 at 20:19

leemes

44,967
21
135
183

Arbitrary precision is almost certainly not the answer here. – David Heffernan Dec 01 '13 at 20:21
@DavidHeffernan This was meant for the matrix computation, not for the difference. Computing the difference of two `double`s can't reverse the error which was done in the computation *before* subtracting them; that's the basic point of my answer. – leemes Dec 01 '13 at 20:22
My point is that it is likely that the best approach is to embrace the imprecision – David Heffernan Dec 01 '13 at 20:26
I think most of the imprecision might be coming from `A*X`, where I simply return the matrix whose entries are the Euclidean inner product, http://en.wikipedia.org/wiki/Dot_product , of the respective `rowVector` of A and X. (There would accumulate the maximum possible error...) – Mike Warren Dec 01 '13 at 23:19

score 1 · Answer 2 · answered Dec 01 '13 at 20:25

Finite precision floating point data types cannot represent all possible real values. There are an infinite number of different values, and so it is easy to see that not all values are representable in a type of finite size.

So it's perfectly plausible that your true solution will be a non-representable value. No amount of trickery can get you an exact solution in the finite data type.

You need to re-calibrate your expectations to match the reality of finite precision floating point data types. The starting point is What Every Computer Scientist Should Know About Floating-Point Arithmetic.

+1 The second sentence is deeper than people think if they've never studied any discrete math. Most never consider the infinitude of *real* values between *any* two distinct real values. Nicely worded. — WhozCraig, Dec 01 '13 at 20:35

score 1 · Answer 3 · answered Dec 02 '13 at 01:58

1

To all the people answering the question: I knew, and figured out by accident, that the cardinality of the set of all possible doubles was finite. I suppose I have no choice but to either try a higher-precision number, or create my own class that represents a HugeDecimal.

answered Dec 02 '13 at 01:58

Mike Warren

3,796
5
47
99

That's terrible that such a knowledge may be gained by accident. How the heel did you used the floating point nubmers before? – V-X Dec 02 '13 at 05:08
Higher precision won't help. The issue is that the computer is finite. Why would decimal help? What makes you think the problem is down to the use of binary representation. Why do you feel the need to represent numbers exactly? I think perhaps you meant for this to be a comment rather than an answer. – David Heffernan Dec 02 '13 at 07:34
I am not looking for exact; I am just looking for numbers representing the error (my main concern) to be as close to what, for example, wolframalpha.com gets; //even though I now know that WolframAlpha was written with millions of lines of Mathematica code and uses over 10,000 CPUs: http://en.wikipedia.org/wiki/Wolfram_Alpha . Maybe I should post what I get from there as well as the results of running my program. – Mike Warren Dec 02 '13 at 13:40
Also, the error matrix `A*X-B` returned by Matlab is not the same as the one I am getting (maybe because I decided to upgrade to long double in C++ program)? – Mike Warren Dec 03 '13 at 09:01

V-X · Answer 4 · 2013-12-02T05:07:12.940

0

replace equality by a check for difference bigger than some given epsilon (a constant with meaning as minimal distinguishable difference).

edited Dec 02 '13 at 05:07

answered Dec 01 '13 at 19:56

V-X

2,979
18
28

basically, machine epsilon? – Mike Warren Dec 01 '13 at 19:58
1

@MikeWarren: you might want to use `std::numeric_limits::epsilon()`, possibly suitably scaled to a suitable order of magnitude to be useful for the values of your problem. For the use you describe, I would just use something which is sufficiently small, e.g., `1e-6`. – Dietmar Kühl Dec 01 '13 at 20:11

score 0 · Answer 5 · answered Dec 01 '13 at 20:14

You cannot expect to get infinite precision with floating point numbers. You should consider what precision is needed, and then choose the simplest method that satisfies your needs. So if you get the same result then stick with normal subtraction and use an epsilon as suggested in V-X's answer.

How do you end up with a O(n^2) complexity for the conjugate method? You have a fixed set of operations, two additions, one subtraction and one division. Assuming all three operations are O(1) then you have get O(n) for applying it to n numbers.

@MikeWarren I am not be sure about that. On a 64bit architecture I would not expect 32bit multiplication to be faster than 64bit multiplication. On the other hand I would expect 128bit multiplication to be a lo slower. But it all depends on the hardware. — Silas, Dec 02 '13 at 08:20

score 0 · Answer 6 · answered Dec 01 '13 at 20:47

0

While this may not help you choose a method, a while ago I wrote a tool that may help you choose a precision based on the sorts of values you're expecting:

http://riot.so/floatprecision.html

As other answers have said, you can't expect to get infinite precision with floating point, but you can use tools such as this to obtain the minimum increment and decrement size of a given number, and work out what is the optimal precision to use to get the accuracy you need.

answered Dec 01 '13 at 20:47

Riot

15,723
4
60
67

[Your tool behaves unexpected for input `1.0000001`](http://riot.so/cgi-bin/floatprecision_htmlout?float=1.0000001). It shows `1.000000` for both float and double because you output in fixed point precision (not floating). – leemes Dec 01 '13 at 20:52
That was intentional to make it easier to see the changes (as it was built for larger numbers) - if i can find the source i'll tweak it to show both fixed and floating representations of the same numbers – Riot Dec 01 '13 at 20:53

Handling loss of precision in subtracting two doubles that are close to each other

OK, so why haven't you chose a method and moved on?

6 Answers6