Lets say we have 2 vectors V and W of n length. I launch a kernel in SYCL that performs 3 iterations of a for loop on each entity of V. the description of for loop is as follows:
First the loop computes the value of W at an index (W[idx]) based on 4 random values of V at the current iteration. i.e., W[idx] = sum (V[a] + V[b] + V[c]+ V[d]). where a,b,c, and d are not contiguous but are defined for each idx.
Updates V[idx] based on W[idx]. However, this update of V[idx] should be done only after the value at V[idx] has been used in step 1 to compute W.
Lets say I have 3 iterations of the for loop in the kernel. If one thread is in iteration 1 and tries to use V[2] of iteration 1 to compute W[idx = 18] at iteration 1. Another thread lets say is in iteration 2 and tries to compute W[2] in iteration 2 from a,b,c,d and computes V[2] already at iteration2.
If the second thread is ahead of first thread, the second thread will update the value of V[2] at the iteration 2. In that case, when first thread wants to use the V[2] of first iteration, how do I make sure that this is Syncd. in SYCL. Will using atomic_ref help in this case, considering that the second thread is aiming to write V[2] only after it has been used by thread [1]. Also to be noted that this V[2] of first iteration is also required to compute some other W's as well in the first iteration running in some other threads as well. How do I ensure that the value of V[2] in the second iteration gets updated in the second iteration, only when V[2] of first iteration has been used in all the required instances? Here is the source code:
void jacobi_relaxation(cl::sycl::queue& q, ProblemVar& obj, int current_level) {
for (int iterations = 1; iterations <= mu1; iterations++) {
// TODO => v(k+1) = [(1 - omega) x I + omega x D^-1 x(-L-U)] x v(k) + omega x
// D^-1
// x
// f
//
// step 1 => v* = (-L-U) x v
// step 2 => v* = D^-1 x (v* + f)
// step 3 => v = (1-omega) x v + omega x v*
q.submit([&](cl::sycl::handler& h) {
// Accessor for current_level matrix CSR values
auto row = obj.A_sp_dict[current_level].row.get_access<cl::sycl::access::mode::read>(h);
auto col = obj.A_sp_dict[current_level].col.get_access<cl::sycl::access::mode::read>(h);
auto val = obj.A_sp_dict[current_level].values.get_access<cl::sycl::access::mode::read>(h);
auto diag_indices
= obj.A_sp_dict[current_level].diag_index.get_access<cl::sycl::access::mode::read>(h);
auto vec = obj.vecs_dict[current_level].get_access<cl::sycl::access::mode::read>(h);
auto f = obj.b_dict[current_level].get_access<cl::sycl::access::mode::read>(h);
cl::sycl::accessor<double, 1, cl::sycl::access::mode::write> vec_star{
obj.temp_dict[current_level], h, cl::sycl::noinit};
// Require 2 kernels as we perform Jacobi Relaxations
h.parallel_for(
cl::sycl::range<1>{obj.num_dofs_per_level[current_level]}, [=](cl::sycl::id<1> idx) {
// double diag_multiplier = 0.0;
vec_star[idx[0]] = 0.0;
for (std::int32_t i = row[idx[0]]; i < row[idx[0] + 1]; i++) {
vec_star[idx[0]] += -1.0 * val[i] * vec[col[i]];
}
vec_star[idx[0]] = (1.0 / val[diag_indices[idx[0]]]) * (vec_star[idx[0]] + f[idx[0]])
+ vec[idx[0]]; // step 2
});
});
q.wait();
q.submit([&](cl::sycl::handler& h) {
// Accessor for current_level vector
auto vec = obj.vecs_dict[current_level].get_access<cl::sycl::access::mode::read_write>(h);
auto vec_star
= obj.temp_dict[current_level].get_access<cl::sycl::access::mode::read_write>(h);
h.parallel_for(cl::sycl::range<1>{obj.num_dofs_per_level[current_level]},
[=](cl::sycl::id<1> idx) {
vec[idx[0]] = (1.0 - omega) * vec[idx[0]] + omega * vec_star[idx[0]]; // step
// 3
vec_star[idx[0]] = 0.0;
});
});
q.wait();
}
}
If you see, for each iteration, I am forced to launch 2 kernels so that I can create a synchronization points between the 2 calculations. and also at the end of 2nd calculation. I want to find a way in which I create only one kernel, and perform iterations inside that kernel with the synchronization present.