SYCL dot product code gives wrong results

Question

In the process of learning SYCL/DPC++, I wrote a SYCL GPU-enabled dot product code (full code on GitHub).

#include <iostream>
#include <sstream>
#include <cmath>
#include <CL/sycl.hpp>

int main (int argc, char* argv[]) {

  unsigned long N;

  /* snip */

  // Select GPU device and create queue
  sycl::device mygpu { sycl::gpu_selector_v };
  sycl::queue q(mygpu);

  /* snip */

  // Allocate vectors on GPU and attach buffers
  double* vecA = sycl::malloc_device<double>(N, q);
  double* vecB = sycl::malloc_device<double>(N, q);

  // Initialize vectors on GPU using SYCL lambda kernels
  q.submit( [&](sycl::handler& h) {
    h.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
      vecA[i] = (double)i;
    });
  });
  q.submit( [&](sycl::handler& h) {
    h.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
      vecB[i] = 2.0 * (double)i;
    });
  });

  // Allocate result as shared memory and attach buffer
  double* result = sycl::malloc_shared<double>(1, q);
  sycl::buffer bufres = sycl::buffer<double>(result, 1);

  // Initialize result to 0
  q.submit( [&](sycl::handler& h) {
    sycl::accessor res(bufres, h, sycl::write_only);
    h.single_task([=]() {
      res[0] = 0.0;
    });
  });

  // Manual synchronization
  q.wait();

  // Perform dot product using SYCL lambda kernel and OneAPI built-in reduction
  q.submit( [&](sycl::handler& h) {
    auto red = sycl::reduction(bufres, h, sycl::plus<>());
    h.parallel_for(sycl::range<1>(N), red, [=](sycl::id<1> i, auto &tmp) {
      double prod = vecA[i] * vecB[i];
      tmp += prod;
    });
  });

  // Transfer result to host and synchronize
  // Note: host_accessor is blocking (thus can be used to synchronize)
  sycl::host_accessor res(bufres, sycl::read_only);

  // Check value (using relative error) and print to stdout
  double tol = 1.0e-10;
  double check = (double)N * ((double)N - 1) * (2.0*(double)N - 1.0) / 3.0;
  if (std::fabs(res[0]/check - 1.0) > tol) {
    std::cout << "Error! Result = " << res[0]
              << " when it should be " << check << std::endl;
  } else {
    std::cout << "Success! Result = " << res[0] << std::endl;
  }

  // Clean up
  sycl::free(vecA, q);
  sycl::free(vecB, q);
  sycl::free(result, q);

  return 0;
}

Now, for some reason, when I pass in a vector length of N = 1 billion, the results are wrong:

$ ./ddot.icpx.x 1000000000
Using device Intel(R) Iris(R) Pro Graphics P580 [0x193a]
Error! Result = 5.63507e+26 when it should be 6.66667e+26

The specific device involved is Intel Xeon CPU E3-1585L v5, which is a Skylake.

If it helps, here is the compiler version string:

$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.0.0/linux/bin-llvm/../bin/icpx.cfg

Am I hitting an "invisible" hardware barrier on the maximum value for sycl::range<1>(N) for this device? If yes, what should I get from sycl::info::device so I can add a block in my code to cap N to this number?

Please provide a [mre] within the question without relying on external links — Alan Birtles, Mar 02 '23 at 22:16
TBH the code as I posted on GitHub is already a minimal reproducible example. I just copied the whole thing and snipped out irrelevant I/O parts — wyphan, Mar 02 '23 at 22:29
It would be greatly helpful if you provide the built binary file to us so that we can investigate this issue from our end. — SeshaP-Intel, Mar 28 '23 at 09:22

score 1 · Answer 1 · answered Mar 27 '23 at 21:35

1

Thanks for posting the question. For others who may read this, the code posted here is correct, so no issues there.

You actually have identified a bug in our toolchain. I will post here when an update is available that properly fixes this issue.

answered Mar 27 '23 at 21:35

TonyM

366
1
4

score 1 · Answer 2 · answered Mar 28 '23 at 03:17

1

We have a work-around for the integrated graphics systems specified by reporter. Please instantiate into the environment the value "-ze-intel-greater-than-4GB-buffer-required" for variable GC_ExtraOCLOptions, e.g.,

export GC_ExtraOCLOptions=-ze-intel-greater-than-4GB-buffer-required

Then rebuild source and execute application:

~/<pathTo>/sycl-dpcpp-dotprod-test> make all
~/<pathTo>/sycl-dpcpp-dotprod-test>./ddot.sycl-iris-whpan-1billion_icpx.x 1000000000
Using device Intel(R) Iris(R) Pro Graphics P580 [0x193a]
Success! Result = 6.66667e+26

answered Mar 28 '23 at 03:17

Rashawn Knapp

11
1

Thanks for the tip Rashawn, unfortunately it didn't work. – wyphan Mar 28 '23 at 16:05
@wyphan, which version of icpx are you using? I conducted my test using 2023.0.0: `icpx --version Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201)` – Rashawn Knapp Mar 28 '23 at 23:00
I posted the compiler version in the question, looks like it's exactly the same version. Maybe the runtime libraries or driver versions differ? – wyphan Mar 29 '23 at 16:51

SYCL dot product code gives wrong results

2 Answers2