Why the intermediate values in matrix inverse calculations are slightly different between C# and CUDA C++?

Question

I wrote two programs to calculate the inverse of a matrix using Gaussian elimination, the first program was in C# and the second in CUDA C++. The two programs follow exactly the same procedure and gives the same final results. However, when I checked the values within the intermediate steps, I found slightly different values, less than 1e-5 relative error.

Here is a part of each code of both programs.

C#

int i, j, i1, n, y, z;
double[,] M = new double[n, n];
double[,] inv = new double[n, n];
for (i = 0; i < n; i++)
    inv[i, i] = 1;
for (i = 0; i < n; i++)
{
    for (j = i + 1; j < n; j++)
        M[i, j] /= M[i, i];
    for (j = 0; j < n; j++)
        inv[i, j] /= M[i, i];
    if (i != n - 1)
    {
        for (i1 = i + 1; i1 < n; i1++)
            if (Math.Abs(M[i1, i]) >= 1e-9)
            {
                for (j = i + 1; j < n; j++)
                    M[i1, j] -= M[i1, i] * M[i, j];
                for (j = 0; j < n; j++)
                    inv[i1, j] -= M[i1, i] * inv[i, j];
            }
        f = new StreamWriter("CPU.csv");
        for (y = 0; y < n; y++)
        {
            for (z = 0; z < n; z++)
                f.Write(M[y, z].ToString() + ",");
            for (z = 0; z < n; z++)
                f.Write(ans[y, z].ToString() + ",");
            f.WriteLine();
        }
        f.Close();
    }
}
for (i = n - 1; i > 0; i--)
{
    for (i1 = 0; i1 < i; i1++)
        if (Math.Abs(M[i1, i]) >= 1e-9)
            for (j = 0; j < n; j++)
                inv[i1, j] -= M[i1, i] * inv[i, j];
}

CUDA C++

int i, j;
double v;
double* d_A, * d_B, * d_v, * Z;
size = n * n * sizeof(double);
cudaMalloc(&d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMalloc(&d_B, size);
cudaMalloc(&d_v, sizeof(double));
Z = new double[n * n];
Unity <<<1, n>>> (d_B, n);
cudaDeviceSynchronize();
for (i = 0; i < n; i++)
{
    GetVal <<<1, 1>>> (d_A, i * (n + 1), d_v);
    cudaMemcpy(&v, d_v, sizeof(double), cudaMemcpyDeviceToHost);
    if (i != n - 1)
        DivideRow <<<1, n - i - 1>>> (d_A, i * (n + 1) + 1, n - i - 1, v);
    DivideRow <<<1, n>>> (d_B, i * n, n, v);
    cudaDeviceSynchronize();
    cudaMemcpy(Z, d_A, size, cudaMemcpyDeviceToHost);
    cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
    if (i != n - 1)
    {
        dim3 GridA(1, 1);
        dim3 BlockA(n - i - 1, n - i - 1);
        dim3 GridB(1, 1);
        dim3 BlockB(n - i - 1, n);
        ModifyRow <<<GridA, BlockA>>> (d_A, i, i, i + 1, n - i - 1, n - i - 1);
        ModifyRow <<<GridB, BlockB>>> (d_A, n, i, i, d_B, i + 1, 0, n - i - 1, n);
        cudaDeviceSynchronize();
        cudaMemcpy(Z, d_A, size, cudaMemcpyDeviceToHost);
        cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
        myfile.open("GPU.csv");
        for (x = 0; x < n; x++)
        {
            for (y = 0; y < n; y++)
                myfile << Z[x * n + y] << ",";
            for (y = 0; y < n; y++)
                myfile << B[x * n + y] << ",";
            myfile << "\n";
        }
        myfile.close();
    }
}
cudaFree(d_v);
for (i = n - 1; i > 0; i--)
{
    dim3 GridB(1, 1);
    dim3 BlockB(i, n);
    ModifyRow <<<GridB, BlockB>>> (d_A, n, i, i, d_B, 0, 0, i, n);
    cudaDeviceSynchronize();
    cudaMemcpy(Z, d_A, size, cudaMemcpyDeviceToHost);
    cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
}
cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);

I compared the values in CPU.csv and GPU.csv files, and found these differences.

What could be the reason for this? Do the calculations in CUDA C++ have less precision than C#?

For future reference, the only really valid way to compare floating point results from different platforms or libraries is to write out the data in binary form (so the whole IEEE 753 binary64 format in this case) and then load those and compare them number for number. Relying on csv files leaves you at the mercy of different float to string conversion routines (and then possibly the string to float conversion if you use something like Matlab to load and compare results). Never do that — talonmies, Sep 25 '20 at 01:47
@talonmies Thank you, I appreciate this, and I think i will use it in the future, can you give me a reference how to write out the data in binary form? — AbdelAziz AbdelLatef, Sep 25 '20 at 01:49
How about this as a starting point? https://stackoverflow.com/q/30923685/681865 — talonmies, Sep 25 '20 at 01:51
Please don't make more work for other people by vandalizing your posts. By posting on the Stack Exchange network, you've granted a non-revocable right, under the [CC BY-SA 4.0 license](//creativecommons.org/licenses/by-sa/4.0/), for Stack Exchange to distribute that content (i.e. regardless of your future choices). By Stack Exchange policy, the non-vandalized version of the post is the one which is distributed. Thus, any vandalism will be reverted. If you want to know more about deleting a post please see: [How does deleting work?](//meta.stackexchange.com/q/5221) — Sabito stands with Ukraine, Jan 18 '21 at 18:59

score 6 · Accepted Answer · answered Sep 24 '20 at 23:39

6

From the NVIDIA documentation (about 2/3 of the way down):

The consequence [of rounding] is that different math libraries cannot be expected to compute exactly the same result for a given input. This applies to GPU programming as well. Functions compiled for the GPU will use the NVIDIA CUDA math library implementation while functions compiled for the CPU will use the host compiler math library implementation (e.g., glibc on Linux). Because these implementations are independent and neither is guaranteed to be correctly rounded, the results will often differ slightly.

Tells you all you need to know, really.

answered Sep 24 '20 at 23:39

Paul Sanders

24,133
4
26
48

I would have accepted this if I had found differences in the final results, but the final results were exactly the same, not a single difference in digits, while the intermediate values had slight differences. – AbdelAziz AbdelLatef Sep 24 '20 at 23:41
To what precision did you output the final results? – Paul Sanders Sep 24 '20 at 23:41
I used `WriteLine` in C# for both programs, please note that the CUDA C++ was a dll for a C# program. – AbdelAziz AbdelLatef Sep 24 '20 at 23:43
2

Sorry, I'm not familiar with that, but I would suggest you find a way to output as many digits of precision as possible. In C++ you can do that with [`output_stream << std::setprecision`](https://en.cppreference.com/w/cpp/io/manip/setprecision). – Paul Sanders Sep 24 '20 at 23:45
6

for the final results, you are writing both data sets using C#. For the intermediate results, you are writing one data set using C#, and one data set using C++. – Robert Crovella Sep 24 '20 at 23:49
@RobertCrovella I think this maybe the problem, I shall try this. – AbdelAziz AbdelLatef Sep 25 '20 at 00:06

Why the intermediate values in matrix inverse calculations are slightly different between C# and CUDA C++?

1 Answers1