1

What is the contiguous dimension in an N-dimensional SYCL kernel, i.e. the dimension in which threads of a work-group are expected to belong to the same warp/wavefront? I would have expected it to be dimension 0, but some resource out there suggest otherwise (https://enccs.github.io/sycl-workshop/expressing-parallelism-nd-range/). I can't find any answer in the SYCL specification.

The contiguous dimension in a buffer seems to be the last one. The following code stores contiguous pixels along the last dimension (at least using Intel DPC++).

sycl::buffer<float, 2> buffer1(image.data(), range);

That may seems unrelated. But I would have expected something like this to be a common pattern.

queue.submit([&](sycl::handler &handler) {
    auto access = buffer.get_access<sycl::access_mode::read>(handler);
    handler.parallel_for(buffer.get_range(), [=](sycl::id<2> id) {
        auto v = access[id[0], id[1]];
        ...
    });
});

However, the contiguous dimension of an image seems to be first one (dimension 0). The following code stores contiguous pixels along dimension 0.

sycl::image<2> image1(vector1.data(), sycl::image_channel_order::r, sycl::image_channel_type::fp32, image_range);

Which creates some confusion.

Thanks.

Limmershin
  • 135
  • 4

1 Answers1

1

The contiguous dimension (also known as fast index, vectorized index etc) for buffers is indeed always the "last" one, i.e. index 2 for a 3D object.

This is specified in section 3.11.1 "Linearization" of the SYCL 2020 specification: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:multi-dim-linearization

That particular choice is motivated by C++ which behaves in the same way for its arrays, and SYCL aims to align with C++.

Note that your sample code can also be written in a more idiomatic version that hides this detail:

queue.submit([&](sycl::handler &handler) {
    auto access = buffer.get_access<sycl::access_mode::read>(handler);
    handler.parallel_for(buffer.get_range(), [=](sycl::id<2> id) {
        auto v = access[id];
        ...
    });
});

For images, I am not sure it is that well-defined since images are a backend-specific opaque object. I agree that the specification may be unclear here as to what this means for host pointers that are used as input to the image class. Given that the linearization equation from 3.11.1 is not explicitly restricted to buffers, it could be argued that the behavior you are seeing is a bug in your SYCL implementation.

illuhad
  • 506
  • 2
  • 5
  • The question remains. Are we sure that using such pattern (such as demonstrated in the sample code) would result in coalesced memory access? That would imply that threads of a work-group are expected to belong to the same warp/wavefront along the last dimension. This is definitely different than from CUDA and OpenCL. – Limmershin Mar 10 '23 at 13:16
  • 1
    Yes. SYCL implementations will automatically remap dimensions under the hood such that the SYCL fast index corresponds to the "vectorized index" of the backend, i.e. the index that would need to be contiguous for coalesced memory accesses. – illuhad Mar 11 '23 at 14:27