0

I have a GTX780. It has compute capability 3.5, according both to wikipedia and the output of code querying the device directly. It has block x dimension size limit of 2^31-1 (2147483647), according to both. Yet, the below code only successfully sets a[0]=1 if blocks < 2^16-1 (65535). That's the wikipedia listed limit for versions 2.x and older.

#include <iostream>
#include <string>

#define print(x) cout << #x << " = " << x << endl;
#define arg_read(pos, init) argc>pos? stoi(argv[pos]): init;

using namespace std;

__global__ void f(int* a)
{
  a[0] = 1;
}

int main(int argc, char* argv[])
{
  int blocks = arg_read(1, 1);
  int* a;
  cudaMalloc((void**) &a, sizeof(int)); //allocate a on the device
  int b=100;
  cudaMemcpy(a, &b, sizeof(int), cudaMemcpyHostToDevice); //copy b to a
  f<<<blocks, 1>>>(a); //set a[0] = 1
  cudaMemcpy(&b, a, sizeof(int), cudaMemcpyDeviceToHost); //copy a back to b
  print(b);
}
David Lerner
  • 344
  • 3
  • 14

1 Answers1

0

Was essentially the same problem as here. I needed to compile specifically with the -arch=sm_35 tag.

nvcc -arch=sm_35 sandbox.cu -o sandbox.exe
Community
  • 1
  • 1
David Lerner
  • 344
  • 3
  • 14
  • 1
    It's not the same problem as the question you link. The linked question has a block size that is too large (for any GPU). Once the block and grid sizes were reversed, the code would run on any GPU (it doesn't exceed any limits) and there is no mention of compiling with an arch switch there, nor any difference with that code if it is compiled with our without any arch switch. – Robert Crovella Apr 15 '16 at 19:09