2

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?

The function call operator of the functor for the 'copy_if' basically looks like:

__host__ __device__ bool operator()(float x) {
    bool mark = x < 0.0f;
    if (mark) {
        if (left neighbor of x > 1.0f) return false;
        if (right neighbor of x > 1.0f) return false;
        if (top neighbor of x > 1.0f) return false;
        //etc.
    }
    return mark;
}

Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.


I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?

struct IsEmpty2 {
    float* xi;

    IsEmpty2(float* pXi) { xi = pXi; }

    __host__ __device__ bool operator()(thrust::tuple<float, int> t) {
        bool mark = thrust::get<0>(t) < -0.01f;
        if (mark) {
            int countindex = thrust::get<1>(t);
            if (xi[countindex] > 1.01f) return false;
            //etc.
        }
        return mark;
    }
};


thrust::copy_if(indices.begin(),
                indices.end(),
                thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
                indicesEmptied.begin(),
                IsEmpty2(rawXi));
casperOne
  • 73,706
  • 19
  • 184
  • 253
Bart
  • 166
  • 5
  • You may provide the address of the global array in the constructor of the functor class. You may also access threadIdx.x, and also can use shared memory in the __device__ method above (you should use shared memory in your case I think). – phoad Oct 04 '12 at 18:19
  • Passing a pointer to the global array to the constructor of the functor is possible and then store it as member variable. But that won't solve my problem. Which elements of the global array should I access then? I only have the float x to my disposal, not a pointer to x. As for using shared memory, that is not necessary in my case. I only load the data once and check it for a certain value. – Bart Oct 05 '12 at 09:53
  • Zip iterator in Thrust may solve the problem of finding the location of the float value in the global array. You may either use the threadIdx.x values, but need to approach little differently. About using shared memory, because you need the top, left, right, bottom etc, there will be overlap in the required data per value and you may make the loading of these data from global memory to registers just once using locality, namely shared memory (global mem cache may be ok though). – phoad Oct 05 '12 at 14:29

1 Answers1

1

@phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.

Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":

struct printf_functor {
    __host__ __device__ void operator()(int e) {
        printf("Processing %d\n", threadIdx.x);
    }
};

int main() {
    thrust::device_vector<int> dVec(32);
    for (int i = 0; i < 32; ++i)
        dVec[i] = i + 10;

    thrust::for_each(dVec.begin(), dVec.end(), printf_functor());

    return 0;
}

Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.

Bart
  • 166
  • 5