2

Here is the code which reproduces the unexplained behavior:

main.cpp

#include <iostream>
#include <openacc.h>

extern "C" int findme(float *ARRAY);

int main(){
    float *ARRAY = new float [10];
    int position;

    ARRAY[0] = 97.7302;
    ARRAY[1] = 108.154;
    ARRAY[2] = 99.8558;
    ARRAY[3] = 88.9383;
    ARRAY[4] = 98.4755;
    ARRAY[5] = 109.186;
    ARRAY[6] = 107.205;
    ARRAY[7] = 110.886;
    ARRAY[8] = 86.0737;
    ARRAY[9] = 94.9976;

    #pragma acc enter data copyin(ARRAY[0:10])

    #pragma acc host_data use_device(ARRAY)
        position = findme(ARRAY);

    #pragma acc exit data delete(ARRAY[0:10])

    std::cout << position << std::endl;
}

findme.cu

#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/sort.h>

extern "C" int findme(float *ARRAY){
    thrust::device_ptr<float> ARRAY_ptr(ARRAY);
    thrust::device_vector<float> ARRAY_SORTED(10);
    thrust::copy(ARRAY_ptr, ARRAY_ptr+10, ARRAY_SORTED.begin());
    thrust::sort(ARRAY_SORTED.begin(), ARRAY_SORTED.end(), thrust::less<float>());
    thrust::device_vector<float>::iterator iter = thrust::lower_bound(ARRAY_SORTED.begin(), ARRAY_SORTED.end(), 100);

    return iter - ARRAY_SORTED.begin();
}

There is one device-to-host transfer of 1 byte (I guess the 8 byte device-to-host transfer is the position int, but why 8 bytes instead of 4?)...

enter image description here

... and one host to device transfer of 4 bytes...

enter image description here

... which I cannot explain. Note that the copying in of ARRAY is not seen on the screenshots but is included in the Details tab for the host-to-device transfer (40 bytes). Any clues as to what data is being transferred exactly? Are they inherent to the Thrust algorithms and therefore unavoidable?

lodhb
  • 929
  • 2
  • 12
  • 29
  • I have absolutely no idea how this works, but it seems like the first thing is a pointer to the `ARRAY`, and the second is the transfer back of the `position`. But this is not OpenCL, so I don't quite know how it works. – Mats Petersson Oct 11 '14 at 13:42
  • Hi Mats, are you saying that you think the 1 byte device-to-host transfer is a pointer to `ARRAY`, while the 4 byte host-to-device transfer is the value of the integer `position`? If so, the data sizes and direction of transfers don't make sense. – lodhb Oct 11 '14 at 13:58
  • Sorry, I was convinced your pictures were in the order that they executed, but I now see that the time is different. The first one shows something that happens AFTER the 40 byte transfer, and the second one shows a 40 byte transfer 11ms before and a 4B transfer about 30 ms before the 8B transfer. The 1B transfer is after the 40B transfer, but before the 8 byte transfer. So, don't really know what is happening here. My guess would be that the 1B transfer is some sort of admin thing, and the same for the 8B transfer (for example "I'm starting this work" and "I took X ns to complete", perhaps) – Mats Petersson Oct 11 '14 at 14:14
  • This is my suspicion too, but I was hoping that a Thrust expert (or even contributor) might be able to confirm this. Or maybe there's some explicit data transfer in my code that I'm not seeing? – lodhb Oct 11 '14 at 14:46
  • 2
    Try summoning [Jared Hoberock](http://stackoverflow.com/users/722294) :) – user703016 Oct 11 '14 at 15:39
  • 4
    Hard to say what's going on without isolating the individual calls to `cudaMemcpy`. You could try to step through the code in a debugger -- grep the Thrust source for `cudaMemcpy` and set break points there. I don't really know what's going on in the implementation of `sort` as it is implemented with a call to CUB, but it's possible CUB is deciding how many passes radix sort needs by communicating single bytes of data between host & device. – Jared Hoberock Oct 13 '14 at 00:05

0 Answers0