Recently I started working with numerical computation and solving mathematical problems numerically, programing in C++ with OpenMP. But now my problem is to big and take days to solve even parallelized. So, I’m thinking in start learning CUDA to reduce the time, but I have some doubts.
The heart of my code is the following function. The entries are two pointes to vectors. N_mesh_points_x,y,z
are integers pre-defined, weights_x,y,z
are column matrices, kern_1
is an exponential function and table_kernel
is a function who access a 50 Gb matrix stored in RAM and pre calculated.
void Kernel::paralel_iterate(std::vector<double>* K1, std::vector<double>* K2 )
{
double r, sum_1 = 0 , sum_2 = 0;
double phir;
for (int l = 0; l < N_mesh_points_x; l++){
for (int m = 0; m < N_mesh_points_y; m++){
for (int p = 0; p < N_mesh_points_z; p++){
sum_1 = 0;
sum_2 = 0;
#pragma omp parallel for schedule(dynamic) private(phir) reduction(+: sum_1,sum_2)
for (int i = 0; i < N_mesh_points_x; i++){
for (int j = 0; j < N_mesh_points_y; j++){
for (int k = 0; k < N_mesh_points_z; k++){
if (!(i==l) || !(j==m) || !(k==p)){
phir = weights_x[i]*weights_y[j]*weights_z[k]*kern_1(i,j,k,l,m,p);
sum_1 += phir * (*K1)[position(i,j,k)];
sum_2 += phir;
}
}
}
}
(*K2)[ position(l,m,p)] = sum_1 + (table_kernel[position(l,m,p)] - sum_2) * (*K1)[position (l,m,p)];
}
}
}
return;
}
My questions are:
- Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.
- The function
table_kernel
who access a big matrix, the matrix is too big to be stored in the memory of my video card, so the file will stay in RAM. Is this a problem? Can CUDA easily access the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?