Cuda grid size limitations appear to be lower than they should

Question

I have a GTX780. It has compute capability 3.5, according both to wikipedia and the output of code querying the device directly. It has block x dimension size limit of 2^31-1 (2147483647), according to both. Yet, the below code only successfully sets a[0]=1 if blocks < 2^16-1 (65535). That's the wikipedia listed limit for versions 2.x and older.

#include <iostream>
#include <string>

#define print(x) cout << #x << " = " << x << endl;
#define arg_read(pos, init) argc>pos? stoi(argv[pos]): init;

using namespace std;

__global__ void f(int* a)
{
  a[0] = 1;
}

int main(int argc, char* argv[])
{
  int blocks = arg_read(1, 1);
  int* a;
  cudaMalloc((void**) &a, sizeof(int)); //allocate a on the device
  int b=100;
  cudaMemcpy(a, &b, sizeof(int), cudaMemcpyHostToDevice); //copy b to a
  f<<<blocks, 1>>>(a); //set a[0] = 1
  cudaMemcpy(&b, a, sizeof(int), cudaMemcpyDeviceToHost); //copy a back to b
  print(b);
}

Isn't it race condition and undefined behavior when all threads try writing to same location? — huseyin tugrul buyukisik, Apr 15 '16 at 18:59
compile with an appropriate architecture switch for your GPU, eg. `-arch=sm_35` — Robert Crovella, Apr 15 '16 at 18:59
@huseyintugrulbuyukisik no it is not undefined behavior if all threads are writing the **same value**. After the writes are complete, the location is guaranteed to contain the value written. — Robert Crovella, Apr 15 '16 at 19:01

score 0 · Accepted Answer · edited May 23 '17 at 10:33

0

Was essentially the same problem as here. I needed to compile specifically with the -arch=sm_35 tag.

nvcc -arch=sm_35 sandbox.cu -o sandbox.exe

edited May 23 '17 at 10:33

Community

1
1

answered Apr 15 '16 at 19:00

David Lerner

344
3
14

1

It's not the same problem as the question you link. The linked question has a block size that is too large (for any GPU). Once the block and grid sizes were reversed, the code would run on any GPU (it doesn't exceed any limits) and there is no mention of compiling with an arch switch there, nor any difference with that code if it is compiled with our without any arch switch. – Robert Crovella Apr 15 '16 at 19:09

Cuda grid size limitations appear to be lower than they should

1 Answers1