I have a GTX780. It has compute capability 3.5, according both to wikipedia and the output of code querying the device directly. It has block x dimension size limit of 2^31-1 (2147483647), according to both. Yet, the below code only successfully sets a[0]=1
if blocks < 2^16-1
(65535). That's the wikipedia listed limit for versions 2.x and older.
#include <iostream>
#include <string>
#define print(x) cout << #x << " = " << x << endl;
#define arg_read(pos, init) argc>pos? stoi(argv[pos]): init;
using namespace std;
__global__ void f(int* a)
{
a[0] = 1;
}
int main(int argc, char* argv[])
{
int blocks = arg_read(1, 1);
int* a;
cudaMalloc((void**) &a, sizeof(int)); //allocate a on the device
int b=100;
cudaMemcpy(a, &b, sizeof(int), cudaMemcpyHostToDevice); //copy b to a
f<<<blocks, 1>>>(a); //set a[0] = 1
cudaMemcpy(&b, a, sizeof(int), cudaMemcpyDeviceToHost); //copy a back to b
print(b);
}