I have this code here (modified due to the answer).
Info
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes cmem[2], 76 bytes cmem[16]
I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"
When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.
My computations:
46 registers*128threads=5888 registers .
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).With the occupancy calculator i found that using 128 threads/block gives me 42% and am in the limits of my card.
Also,the number of threads per MP is 640 (limit is 1536)
Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:
cuEventDestroy failed: launch timeout
cuModuleUnload failed: launch timeout
1) I don't know/understand what else is needed to be computed.
2) I can't understand how we use/find the number of the blocks.I can see that mostly,someone puts (threads-1+points)/threads ,but that still doesn't work.
--------------UPDATED----------------------------------------------
After using driver.Context.synchronize() ,the code works for many points (1000000)!
But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?
--------------UPDATED2----------------------------------------------
Now,the code doesn't work again without doing anything!
Snapshot of code:
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t
#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)
J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))
#-----------------------------------------------------------------------------------
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
typedef pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];
__device__ __constant__ float Pi;
extern "C"{
__device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,
cp3 J[],
cp3 M[],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec, cmplx *All)
{
while (c<numPointsRs){
...
c++;
}
}
__global__ void computeEHfields(float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3], cmplx *All )
{
fp3 * Rs_mat=(fp3 *)Rs_mat_;
fp3 * Rp_mat=(fp3 *)Rp_mat_;
cp3 * J=(cp3 *)J_;
cp3 * M=(cp3 *)M_;
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
}
""" ,no_extern_c=1,options=['--ptxas-options=-v'])
#call the function(kernel)
func = mod.get_function("computeEHfields")
func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()
My card:
Device 0: "GeForce GTX 560"
CUDA Driver Version / Runtime Version 4.20 / 4.10
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP