1

the CUDA Programming Guide said that any atomic operation can be implemented using atomicCAS(), and gives an example of atomic double add:

__device__ float single(double *address,double val)
{
unsigned long long int *address_as_ull =(unsigned long long int*)address;
unsigned long long int assumed;
unsigned long long int old = *address_as_ull;

do
{
    assumed = old;
    old = atomicCAS(address_as_ull,assumed,__double_as_longlong(val + __longlong_as_double(assumed)));
}while(assumed !=old);
   return __longlong_as_double(old);
}

now,I face the problem that:

I want to write a function that can operate two variables address atomically.

for example: atomic add about two variable

input

double *address_1, int *address_2
double val_1,int val_2

result

*address_1 = *address_1+val_1;
*address_2 = *address_2+val_2;

how can I deal with the problem? thanks.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Xu Hui
  • 1,213
  • 1
  • 11
  • 24
  • 1
    What is wrong with calling the atomic addition function twice? From the description in your question, the two operations are independent. – talonmies Jan 11 '19 at 07:52
  • here is an simple example, actually in this case do the operation twice is ok ~ but in my situation, i need the atomic function with two address. do you have any suggestion? thanks~ – Xu Hui Jan 11 '19 at 08:00
  • Nothing like that exists – talonmies Jan 11 '19 at 08:03
  • I think using a structure to store the two variable may be help. you see the double is transfer into ull, but I have no idea how to deal with the unsigned long long int and structure. – Xu Hui Jan 11 '19 at 08:07

3 Answers3

2

In general, you cannot do this. The hardware does not support atomic changes to multiple locations in memory. While you can circumvent this if both variables are small enough to fit into the size of a single atomic operation - this approach will fail if you have more than 8 bytes overall. You will encounter the "too much milk" problem.

One thing you could do is have some kind of synchronization protocol for accessing these two values. For example, you could use a mutex, which only one thread can obtain, to safely know that nobody else is changing the values while that thread is working on them. See: Avoid taking a long time to finish the 'too much milk' scenario.

Of course, this is quite expensive in a GPU setting. You had probably better do one of the following (by increasing order of favorability):

  • Use a pointer or an index into a larger array, and instead of changing the structure atomically, switch the pointer atomically. This resolves the concurrency issue, but makes accesses slower.
  • Change your algorithm so that accesses can be separated and don't have to happen atomically.
  • Change your algorithm further, to avoid having multiple threads writing to a single complex data structure.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1

I think you're missing the point of the operation implemented here. In a+=b, the logical operation is a = a + b, but with CAS you avoid spurious changes to a between its read and its write. b is used once and not a problem.

In a = b + c, none of the values appear twice, so there's no need to protect against any changes in between.

MSalters
  • 173,980
  • 10
  • 155
  • 350
0

thanks all the guys reply me! I now have the solution. we can combine the two variable into a structure. so we can transfer "two variable with two address" into "one structure with one address". here is the code:

#include <stdio.h>
struct pair_t
{
    float x;
    int y;
};

__device__ float single(double *address,double val)
{   

    unsigned long long int *address_as_ull =(unsigned long long int*)address;
    unsigned long long int assumed;
    unsigned long long int old = *address_as_ull;

    do
    {
        assumed = old;
        old = atomicCAS(address_as_ull,assumed,__double_as_longlong(val + __longlong_as_double(assumed)));
    }while(assumed !=old);
    return __longlong_as_double(old);
}



__device__ void myadd(pair_t *address, double val_1 ,int val_2)
{   
    union myunion
    {  
        pair_t p;
        unsigned long long int ull;
    };

    unsigned long long int *address_as_ull;
    address_as_ull = (unsigned long long int *)address;

    union myunion assumed;
    union myunion old_value;
    union myunion new_value;

    old_value.p = *(pair_t *)address_as_ull;

    do
    {
        assumed = old_value;
        // cirtical area begin--------------------
        new_value.p.x = assumed.p.x+val_1;
        new_value.p.y = assumed.p.y+val_2;
        // cirtical area end----------------------

        old_value.ull = atomicCAS(address_as_ull,assumed.ull,new_value.ull);
    }while(assumed.ull !=old_value.ull);
}


__global__ void kernel (pair_t *p)
{
    myadd(p,1.5,2);
}

int main()
{
    pair_t p;
    p.x=0;
    p.y=0;
    pair_t *d_p = NULL;
    cudaMalloc((pair_t **)&d_p, sizeof(pair_t));
    cudaMemcpy(d_p, &p, sizeof(pair_t), cudaMemcpyHostToDevice);

    kernel<<<100, 100>>>(d_p);

    cudaMemcpy(&p, d_p, sizeof(pair_t), cudaMemcpyDeviceToHost);

    cudaDeviceSynchronize();
    printf("x=%lf\n", p.x);
    printf("y=%d\n", p.y);
    cudaDeviceReset();
    return 0;
}

and the solution is

x=15000.000000
y=20000

now everything will be ok~

Xu Hui
  • 1,213
  • 1
  • 11
  • 24
  • 1. This is convoluted. You don't have to go through doubles at all. 2. More generally - this only works since the total size of the structure is small enough for you to act on it atomically. – einpoklum Jan 11 '19 at 21:07