more than one variable to synchronize in CUDA

Question

My program have lots of 4-byte-string, like "aaaa" "bbbb" "cccc"... I need to collect particular strings that passes a crc checking.

Because there's very little chance that a string can pass the crc checking, so I don't want to use a very BIG buffer to hold all results. I prefer the result concated one by one, just like the input. For example, if the input is "aaaabbbbcccc" and "bbbb" doesn't pass the crc checking, the output string should be "aaaacccc" and output_count should be 2.

The code looks like:

__device__
bool is_crc_correct(char* str, int len) {
    return true; // for simplicity, just return 'true';
}

// arguments:
// input: a sequence of 4-bytes-string, eg: aaaabbbbccccdddd....
__global__
void func(char* input, int* output, int* output_count) {
    unsigned int index = blockDim.x*blockIdx.x + threadIdx.x;

    if(is_crc_correct(input + 4*index)) {
        // copy the string
        memcpy(output + (*output_count)*4,
               input + 4*index,
               4);
        // increase the counter
        (*output_count)++;
    }
}

Obviously the memory copy is not thread safe, I know atomicAdd function can be used for the ++ operation, but how to make both output and output_count thread safe?

I believe you are trying to reinvent *Stream compaction*, and in particular, *Gather* operation in an very inefficient way. Parallel programming often requires different thinking. For example, you avoid races, not trying to solve them with atomics and locks (serialization is kinda defeats the purpose of parallelization). You could probably use [thrust::copy_if](https://thrust.github.io/doc/group__stream__compaction.html). — Ivan Aksamentov - Drop, Nov 10 '15 at 10:48

user703016 · Accepted Answer · 2015-11-11T01:20:26.740

What you are looking for is a lock-free linear allocator. The usual way of doing this is by having an atomically-increased accumulator that is used to index into a buffer. For example, in your case, the following should work:

__device__
char* allocate(char* buffer, int* elements) {
    // Here, the size of the allocated segment is always 4.
    // In a more general use case you would atomicAdd the requested size.
    return buffer + atomicInc(elements) * 4;
}

Which can then be used as such:

__global__
void func(char* input, int* output, int* output_count) {
    unsigned int index = blockDim.x*blockIdx.x + threadIdx.x;

    if(is_crc_correct(input + 4*index)) {
        // Reserve the output buffer.
        char* dst = allocate(output, output_count);
        memcpy(dst, input + 4 * index, 4);
    }
}

While this is perfectly thread safe, it is not guaranteed to preserve the input order. For example, "ccccaaaa" would be a valid output.

As Drop has mentioned in their comment, what you are trying to do is effectively a stream compaction (and Thrust already likely already provides what you need).

The code I posted above could be further optimized by first aggregating the output string by warp rather than directly allocating into the global buffer. This would reduce global atomic contention and likely lead to better performance. For an explanation on how to do this, I invite you to read the following article: CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics.

score 1 · Answer 2 · edited May 23 '17 at 12:33

1

I might end up going to hell for suggesting this, but how about allocating memory dynamically inside the kernel? See this question/answer for an example: CUDA allocate memory in __device__ function

You would then have pass a shared memory array to each kernel, and after the kernel has run each element of the array would either point to a piece of dynamically allocated memory, or NULL. So after your threadblocks have run you would run a final cleanup kernel, on a single thread, to build the final string.

edited May 23 '17 at 12:33

Community

1
1

answered Nov 10 '15 at 10:17

endian

4,234
8
34
42

Thanks, I don't even know that the kernel can use malloc/free. I remember I've seen some compile error like "can't use host function in device code", maybe it was the old CUDA 3.x. Now I tried CUDA 6.5 it compiles. – aj3423 Nov 10 '15 at 10:35

more than one variable to synchronize in CUDA

2 Answers2