-1

General Information

NOTE: I am also decently new to C, OpenAcc.

Hi I am trying to develop an image blurring program, but first I wanted to see if I could parallelize the for loops and copyin/copyout my values.

The problem I am facing currently is when I try to copyin and copyout my data and output variables. The error looks to be a buffer overflow (I have also googled it and that is what people have said), but i am not sure how I should go about fixing this. I think I am doing something wrong with the pointers, but I am not sure.

Thanks so much in advance, if you think that I missed some information please let me know and I can provide it.

Question

  1. I would like to confirm what the error actually is?
  2. How should I go about fixing the issue?
  3. Anything I should look into more so I can fix this kind of issue myself in the future.

Error

FATAL ERROR: variable in data clause is partially present on the device: name=output
file:/nfs/u50/singhn8/4F03/A3/main.c ProcessImageACC line:48
output lives at 0x7ffca75f6288 size 16 not present
Present table dump for device[1]: NVIDIA Tesla GPU 1, compute capability 3.5
host:0x7fe98eaf9010 device:0xb05dc0000 size:2073600 presentcount:1 line:47 name:(null)
host:0x7fe98f0e8010 device:0xb05bc0000 size:2073600 presentcount:1 line:47 name:(null)
host:0x7ffca75f6158 device:0xb05ac0400 size:4 presentcount:1 line:47 name:filterRad
host:0x7ffca75f615c device:0xb05ac0000 size:4 presentcount:1 line:47 name:row
host:0x7ffca75f6208 device:0xb05ac0200 size:4 presentcount:1 line:47 name:col
host:0x7ffca75f6280 device:0xb05ac0600 size:16 presentcount:1 line:48 name:data

Program Definition

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

#include <openacc.h>

// ================================================
// ppmFile.h
// ================================================
#include <sys/types.h>
typedef struct Image
{
  int            width;
  int            height;
  unsigned char *data;
} Image;
Image* ImageCreate(int width,
                   int height);
Image* ImageRead(char *filename);
void   ImageWrite(Image *image,
                  char  *filename);
int    ImageWidth(Image *image);
int    ImageHeight(Image *image);
void   ImageClear(Image        *image,
                  unsigned char red,
                  unsigned char green,
                  unsigned char blue);
void ImageSetPixel(Image        *image,
                   int           x,
                   int           y,
                   int           chan,
                   unsigned char val);
unsigned char ImageGetPixel(Image *image,
                            int    x,
                            int    y,
                            int    chan);

Blur Filter Function

// ================================================
// The Blur Filter
// ================================================

void ProcessImageACC(Image **data, int filterRad, Image **output) {
  int row = (*data)->height;
  int col = (*data)->width;

  #pragma acc data copyin(row, col, filterRad, (*data)->data[0:row * col]) copyout((*output)->data[0:row * col])
  #pragma acc kernels
  {
    #pragma acc loop independent
    for (int j = 0; j < row; j++) {
      #pragma acc loop independent
      for (int i = 0; i < col; i++) {
        (*output)->data[j * row + i] = (*data)->data[j * row + i];
      }
    }
  }
}

Main Function

// ================================================
// Main Program
// ================================================
int main(int argc, char *argv[]) {
  // vars used for processing:
  Image *data, *result;
  int    dataSize;
  int    filterRadius = atoi(argv[1]);

  // ===read the data===
  data = ImageRead(argv[2]);

  // ===send data to nodes===
  // send data size in bytes
  dataSize = sizeof(unsigned char) * data->width * data->height * 3;

  // ===process the image===
  // allocate space to store result
  result         = (Image *)malloc(sizeof(Image));
  result->data   = (unsigned char *)malloc(dataSize);
  result->width  = data->width;
  result->height = data->height;

  // initialize all to 0
  for (int i = 0; i < (result->width * result->height * 3); i++) {
    result->data[i] = 0;
  }

  // apply the filter
  ProcessImageACC(&data, filterRadius, &result);

  // ===save the data back===
  ImageWrite(result, argv[3]);

  return 0;
}
Navleen Singh
  • 155
  • 2
  • 11

1 Answers1

1

The problem here is that in addition to the data arrays, the output and data pointers need to be copied over as well. From the compiler feed back messages, you can see the compiler implicitly copying them over.

% pgcc -c image.c -ta=tesla:cc70 -Minfo=accel
ProcessImageACC:
     46, Generating copyout(output->->data[:col*row])
         Generating copyin(data->->data[:col*row],col,filterRad,row)
     47, Generating implicit copyout(output[:1])
         Generating implicit copyin(data[:1])
     50, Loop is parallelizable
     52, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         50, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         52, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

Now you might be able to get this to work by using unstructured data regions to create both the data and pointers, and then "attach" the pointers to the arrays (i.e. fill in the value of the device pointers to the address of the device data array).

Though an easier option is to create temp arrays to point to the data, and then copy the data to the device. This will also increase the performance of your code (both on the GPU and CPU) since it eliminates the extra levels of indirection.

void ProcessImageACC(Image **data, int filterRad, Image **output) {
  int row = (*data)->height;
  int col = (*data)->width;
  unsigned char * ddata, * odata;
  odata = (*output)->data;
  ddata = (*data)->data;

  #pragma acc data copyin(ddata[0:row * col]) copyout(odata[0:row * col])
  #pragma acc kernels
  {
    #pragma acc loop independent
    for (int j = 0; j < row; j++) {
      #pragma acc loop independent
      for (int i = 0; i < col; i++) {
        odata[j * row + i] = ddata[j * row + i];
      }
    }
  }
}

Note that scalars are firstprivate by default so there's no need to add the row, col, and filterRad variables in the data clause.

Mat Colgrove
  • 5,441
  • 1
  • 10
  • 11