0

I try to execute my slice_matrix function on GPU. The actual function is:

    //Function which Slice a specific part of my matricx
template<class T>
std::vector<std::vector<T>> slice_matrix(std::vector<std::vector<T>> mat, int i,
        int j, int r, int c) {

    std::vector<std::vector<T>> out(r, std::vector<T>(c, 0));

    for (int k = 0; k < r; k++) {
        std::vector<T> temp(mat[i + k].begin() + j, mat[i + k].begin() + j + c);
        out[k] = temp;
    }

    return out;
};

and the SYCL part of my code is:

auto event = gpuQueue.submit(
                [&](sycl::handler &h) {
                    //local copy of fun
                    auto f = fun;
                    sycl::accessor img_accessor(img_buffer, h,
                            sycl::read_only);
                    sycl::accessor ker_accessor(ker_buffer, h,
                            sycl::read_only);
                    sycl::accessor out_accessor(out_buffer, h,
                            sycl::write_only);
                    h.parallel_for(sycl::range<2>(img_row, filt_col),
                            [=](sycl::id<2> index) {
                                int row = index[0];
                                int col = index[1];
                                out_accessor[index] = f(slice_matrix_gpu(img_accessor, row, col, filt_row, filt_col), ker_accessor);

                            });

                });

I know vector<vector> doesn't create a contiguous block of memory. So I used vector and I try to interpret it as a two dimensional block of data. what I defined:

/*change 2D Matrices to the 1D linear arrays,
         *
         *and operate on them as contiguous blocks */
        int M = img_row * img_col;
        int N = filt_row * filt_col;
        int H = out_row * out_col;


        //Define Buffer for
        sycl::buffer<Tin, 1> img_buffer(&img[0], sycl::range<1>(M));
        sycl::buffer<Tin, 1> ker_buffer(&ker[0], sycl::range<1>(N));
        sycl::buffer<Tin, 2> out_buffer(&out[0], sycl::range<2>(out_row, out_col));

but I do not know what should I do?! should I pass my accessor like a 2D, or should I change the slice_matrix and behave like a 2D matrix. I should point out that the slice_matrix function maybe called by other function and in this situation it executes on CPU. I mean this function is not just for executing on GPU, it is also for executing on CPU, which is:

if (use_tbb) {
        uTimer *timer = new uTimer("Executing Code On CPU");
        tbb::parallel_for(
                tbb::blocked_range2d<int, int>(0, out_row, 0, out_col),
                [&](tbb::blocked_range2d<int, int> &t) {
                    for (int n = t.rows().begin(); n < t.rows().end();
                            ++n) {
                        for (int m = t.cols().begin(); m < t.cols().end();
                                ++m) {
                            out[n][m] = fun(
                                    slice_matrix_cpu(img, n, m, filt_row,
                                            filt_col), ker);
                        }
                    }
                });
        timer->~uTimer();
        return out;
saharsa
  • 467
  • 1
  • 7
  • 24

1 Answers1

0

I'm not confident I understand your question, but perhaps this will help and you can let me know if you have other questions.

Your approach does not look like one that will work well for offloading. That makes my mind leap immediately to "refactoring code" - in other words, taking a different approach that will get better performance.

The difficult part is that I really don't know why you selected your approach. So, for now I'll assume this is an option (because if it is not, I'm not sure what advice to give you).

In general, its a really good idea to layout your data in contiguous space if it will be shared with an accelerator. It leads to easier to understand code, and more efficient data transfers. So, I would advise you to do that. Having lots of smaller data elements (like short vectors) will not in general give interesting speed-ups with offload devices.

Once you have done that, SYCL is happy to let you claim it is a 1-D, 2-D, or 3-D array for the accessors. They are assume a linear collection of data, all that changes is how many indices you use to pick a data element. Do what ever feels the most natural.

That is my thinking. If this approach cannot work for you, I think you won't find a GPU to be a good solution. However, if you really stick with it - it is possible that USM would be a cleaner way to code it. However, I do not think you'll get good performance. But, I'm guessing because I don't know your code well enough.

Good luck. I hope this helps... if not, let me know.