2

When I have a kernel on the top loop, Why I can't use these 2 directives:

#pragma acc update device(hbias[0:n_hidden],W[0:n_hidden][0:n_visible])   
#pragma acc update device(vbias[0:n_visible)

I need to update these variables hbias, vbias, W in below code, but it won't work:

void RBM::contrastive_divergence(int train_X[6][6], double learning_rate, int k) {
    double r= rand() / (RAND_MAX + 1.0);

        int * input = new int[n_visible];
        double *ph_mean = new double[n_hidden];
        int *ph_sample = new int[n_hidden];
        double *nv_means = new double[n_visible];
        int *nv_samples = new int[n_visible];
        double *nh_means = new double[n_hidden];
        int *nh_samples = new int[n_hidden];

        #pragma acc kernels
        for (int i = 0; i<train_N; i++) {


            for (int j = 0; j< n_visible; j++){
                input[j] = train_X[i][j];
            }


            sample_h_given_v(input, ph_mean, ph_sample,r);

            for (int step = 0; step<k; step++) {
                if (step == 0) {
                    gibbs_hvh(ph_sample, nv_means, nv_samples, nh_means, nh_samples,r);
                }
                else {
                    gibbs_hvh(nh_samples, nv_means, nv_samples, nh_means, nh_samples,r);
                }
            }


            for (int i = 0; i<n_hidden; i++) {
                for (int j = 0; j<n_visible; j++) {

                 W[i][j] += learning_rate * (ph_mean[i] * input[j] - nh_means[i] * nv_samples[j]) / N;

                }
                hbias[i] += learning_rate * (ph_sample[i] - nh_means[i]) / N;

            }
    //this directive
       #pragma acc update device(hbias[0:n_hidden],W[0:n_hidden][0:n_visible])


            for (int i = 0; i<n_visible; i++) {
                vbias[i] += learning_rate * (input[i] - nv_samples[i]) / N;
            }
    //and this directive
       #pragma acc update device(vbias[0:n_visible)
     }

        delete[] input;
        delete[] ph_mean;
        delete[] ph_sample;
        delete[] nv_means;
        delete[] nv_samples;
        delete[] nh_means;
        delete[] nh_samples;
    }

But when I have many separated kernels working on each nested loops, I can update the variables:

   void RBM::contrastive_divergence(int train_X[6][6], double learning_rate, int k) {
    double r= rand() / (RAND_MAX + 1.0);

        int * input = new int[n_visible];
        double *ph_mean = new double[n_hidden];
        int *ph_sample = new int[n_hidden];
        double *nv_means = new double[n_visible];
        int *nv_samples = new int[n_visible];
        double *nh_means = new double[n_hidden];
        int *nh_samples = new int[n_hidden];


    for (int i = 0; i<train_N; i++) {

            #pragma acc kernels
                for (int j = 0; j< n_visible; j++){
                    input[j] = train_X[i][j];
                }


                sample_h_given_v(input, ph_mean, ph_sample,r);
            #pragma acc kernels
                for (int step = 0; step<k; step++) {
                    if (step == 0) {
                        gibbs_hvh(ph_sample, nv_means, nv_samples, nh_means, nh_samples,r);
                    }
                    else {
                        gibbs_hvh(nh_samples, nv_means, nv_samples, nh_means, nh_samples,r);
                    }
                }

            #pragma acc kernels
            {  
                for (int i = 0; i<unhidden; i++) {
                    for (int j = 0; j<n_visible; j++) {

                        W[i][j] += learning_rate * (ph_mean[i] * input[j] - nh_means[i] * nv_samples[j]) / N;

                    }
                hbias[i] += learning_rate * (ph_sample[i] - nh_means[i]) / N;

                }
        //this directive
            #pragma acc update device(hbias[0:n_hidden],W[0:n_hidden][0:n_visible])
            }


            #pragma acc kernels
            {
                for (int i = 0; i<n_visible; i++) {
                    vbias[i] += learning_rate * (input[i] - nv_samples[i]) / N;
                }

            //and this directive
                #pragma acc update device(vbias[0:n_visible)
            }
     }

        delete[] input;
        delete[] ph_mean;
        delete[] ph_sample;
        delete[] nv_means;
        delete[] nv_samples;
        delete[] nh_means;
        delete[] nh_samples;
    }
J. Chomel
  • 8,193
  • 15
  • 41
  • 69
  • What compiler are you using? If PGI, can you please post the output of -Minfo=accel? It looks like this should work. What about if you add a data region immediately outside of the kernels? This shouldn't be necessary, but might help. – jefflarkin Jan 16 '17 at 14:58
  • Yes, I use PGI compiler. Basically, I need to do reduction operating for some variables. But it also was not accepted by the compiler. I need to synchronize some variables value for each iteration done. Otherwise, the result will be not true. I'll try to add a data region directive and see what I'll get. Thanks – Alwaleed A. Hamam Jan 16 '17 at 17:37
  • I used this command $ pgc++ - fast - acc - ta = tesla:managed - Minfo = accel - o task2 . / RBM.cpp && echo "Compiled Successfully!" on the kernels without any addition directive and the output was the following: – Alwaleed A. Hamam Jan 16 '17 at 18:07
  • If you're using -ta=tesla:managed then the update directives are going to be ignored and the data movement will be triggered as data migrations from the CUDA driver. – jefflarkin Jan 18 '17 at 14:32

1 Answers1

2

"Update" directives can only be used in host code since data movement must be initiated from the host. You can't have them within a compute region.

There are many issues with this code. First, it's probably poor practice to use the same index variable, "i" in this case, for nested loops. Although scoping rules allow it, it makes it difficult to tell which "i" the code is suppose to be using.

The outer "i" loop is probably not safe to parallelize so you shouldn't be putting the "kernels" directive outside this loop. Maybe if you privatized the "input" array and then used atomics when updating the vbias, hbias, W arrays it may work, but your performance would be poor. (you'd also need to determine if the other arrays need to be privatized or are global so need atomic operations).

What I'd suggest is to start by putting "#pragma acc parallel loop" around the inner loops, one at a time. Make sure each works before moving on the next one. Also, I highly doubt the "step" loop is parallelizable so you'll most likely need to parallelize the loops inside the "gibbs_hvh" subroutine instead.

Since you're using CUDA Unified Memory (-ta=tesla:managed) adding data regions probably isn't necessary. However if you are planning on not using Managed memory in the future, the next step would be to add data directives around the outer "i" loop (or at higher point in the program and then use update directive to synchronize data after the outer "i" loop).

Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11
  • Note that Alwaleed sent me his code offline to see what we could do. Parallelizing and offloading the inner loops worked fine but was quite slow due to the small workload size (6). The outer "train_N" loop is not parallelizable due to a dependency on the "W" array. It's used in the sample_h_given_v routine to initialize the means_ph array, but then later updated in the main body of the loop. While we can use atomics to solve the problem of updating the shared "W", "hbias", and "vbais" array, the dependency on "W" will prevent parallelization (or at least getting correct answers) – Mat Colgrove Jan 18 '17 at 16:38