cuda -- out of memory (threads and blocks issue) --Address is out of bounds

Question

I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.

(The parallelism is in the function "computeEvec" inside global computeEHfields function function.) The problems are:

1) The mem check error below.

2) When i use numPointsRp>2000 it show me "out of memory" ,but (if i am not doing wrong) i compute the global memory and it's ok.

-------------------------------UPDATED---------------------------

i run the program with cuda-memcheck and it gives me (only when numPointsRs>numPointsRp):

========= Invalid global read of size 4

========= at 0x00000428 in computeEHfields

========= by thread (2,0,0) in block (0,0,0)

========= Address 0x4001076e0 is out of bounds

========= ========= Invalid global read of size 4

========= at 0x00000428 in computeEHfields

========= by thread (1,0,0) in block (0,0,0)

========= Address 0x4001076e0 is out of bounds

========= ========= Invalid global read of size 4

========= at 0x00000428 in computeEHfields

========= by thread (0,0,0) in block (0,0,0)

========= Address 0x4001076e0 is out of bounds

ERROR SUMMARY: 160 errors

-----------EDIT----------------------------

Also , some times (if i use only threads and not blocks (i haven't test it for blocks) ) if for example i have numPointsRs=1000 and numPointsRp=100 and then change the numPointsRp=200 and then again change the numPointsRp=100 i am not taking the first results!

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv


Rs=np.zeros((numPointsRs,3)).astype(np.float32)
for k in range (numPointsRs): 
    Rs[k]=[0,k,0]

Rp=np.zeros((numPointsRp,3)).astype(np.float32)
for k in range (numPointsRp): 
    Rp[k]=[1+k,0,0]


#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)


J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))

Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))


mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
#define RowRsSize %(numrs)d
#define RowRpSize %(numrp)d


typedef  pycuda::complex<float> cmplx;
extern "C"{


    __device__ void computeEvec(float Rs_mat[][3], int numPointsRs,   
         cmplx J[][3],
         cmplx M[][3],
         float *Rp,
         cmplx kp, 
         cmplx eta,
         cmplx *Evec,
         cmplx *Hvec, cmplx *All)

{

    while (c<numPointsRs){
        ...         
                c++;

                }     
        }


__global__  void computeEHfields(float *Rs_mat_, int numPointsRs,   
        float *Rp_mat_, int numPointsRp,    
    cmplx *J_,
    cmplx *M_,
    cmplx  kp, 
    cmplx  eta,
    cmplx E[][3],
    cmplx H[][3], cmplx *All )
    {
        float Rs_mat[RowRsSize][3];
        float Rp_mat[RowRpSize][3];

        cmplx J[RowRsSize][3];
        cmplx M[RowRsSize][3];


    int k=threadIdx.x+blockIdx.x*blockDim.x;

      while (k<numPointsRp)  
     {

        computeEvec( Rs_mat, numPointsRs,  J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
        k+=blockDim.x*gridDim.x;


    }

}
}

"""% { "numrs":numPointsRs, "numrp":numPointsRp},no_extern_c=1)


func = mod.get_function("computeEHfields")


func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))

print(" \n")


#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()

--------------------GPU MODEL------------------------------------------------

Device 0: "GeForce GTX 560"
  CUDA Driver Version / Runtime Version          4.20 / 4.10
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 0) Multiprocessors x (48) CUDA Cores/MP:     0 CUDA Cores   //CUDA Cores    336 => 7 MP and 48 Cores/MP

Are you copying all the points to the GPU memory at once? What is the size of a point? — Tudor, Aug 28 '12 at 12:58
The size is int.I am calling inside global function another (device) function and there i do the parallelism. — George, Aug 28 '12 at 13:35
Did you calculate how much data you are copying to the device? Maybe you're exceeding the global memory size. — Tudor, Aug 28 '12 at 13:37
Can you point me to some directions about doing that?Thank you! — George, Aug 28 '12 at 13:42
Well just look into your program and calculate (by hand) how much memory you are copying. It's not that hard. — Tudor, Aug 28 '12 at 13:47
Your question doesn't give enough details, your block/grid sizes are valid so there must be something in your kernel or host code causing the error (you don't even say where the error comes from). — Tom, Aug 28 '12 at 14:07
@Tudor:I did the calculations.I don't exceed the memory.Thanks — George, Aug 28 '12 at 14:42
@RogerDahl people shouldn't have to claim their votes -- if they did SO would not work. I always comment when I downvote but I definitely don't make it obvious in my comment that I down voted. There is at least one comment in this chain that could be attributed to the downvote, and the question certainly does not provide enough information to be answered. — harrism, Aug 29 '12 at 03:17
But you are doing retaliatory upvoting, which in many cases just bumps bad questions up to the top of the list. This might help the ego of the poster, but it makes the task (my task) of supporting a developer community more difficult. — harrism, Aug 29 '12 at 04:50
@harrism: I see it more as canceling out an anonymous and unexplained downvote that was allowed by a design flaw in SO. But I'm changing my mind on entering the fray on the anonymous comments issue. You have a point about the ordering of the questions and maybe many people feel like you do, and I'll just end up annoying everyone. I can always hope that the powers that be will change the way this works some time in the future :) — Roger Dahl, Aug 29 '12 at 05:27
@George: you have said what gpu this is. is it a Fermi or Kepler ? — talonmies, Aug 29 '12 at 06:30
@George: In that case you probably *are* running out of memory. Runtime heap on Fermi and Kepler can use a lot of memory. But I repeat what I said on your earlier, almost identical question. If you can't post a working example that reproduces the problem, it will be almost impossible for anyone to help you. — talonmies, Aug 29 '12 at 10:33
@talonmies:Ok,you want me to post the whole working program (a little big) in order for you to run it or only the global function?I also show that if i set R=10 the program doesn't run also and shows me " pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid value".Thanks — George, Aug 29 '12 at 10:36
@George: I don't want to see hundreds of lines of code. I want to see a *concise* case that reproduces the problem. If you can't do that, you haven't thought about this enough. And that last error means you have an invalid block dimension in your function call. — talonmies, Aug 29 '12 at 11:23
@talonmies:Ok, i put some code.The parallelism is in the global function on the function "computeEvec".Thanks! — George, Aug 29 '12 at 11:31
@talonimes:Ok , for the last error(invalid value) it was because i left the block to (numpointsRp/20) which was for the Rp=10000. — George, Aug 29 '12 at 11:57
What is your GPU model name? Telling us the compute capability is not enough, because there are multiple GPU models (and memory sizes) with the same CC. — harrism, Aug 31 '12 at 00:19
@harrism:I updated.Just for GPU cores ,the devicequery returned me "0" , but i searched from nvidia and found it has 336 cores. — George, Aug 31 '12 at 11:25
@harrism:I can see (for pointsRp=1000) that if i use block(16,16,1) it runs ok but if i use block(100,10,1) it shows "out of resources".But the problem still remains for 10000 points where it gives me "out of memory". — George, Aug 31 '12 at 17:01
@George: The cuda-memcheck output in your recent edit is helpful. It confirms what I told you months ago in your older question: you have an index calculation problem somewhere leading to an out-of-bounds memory access. But if you can't show the *actual* kernel code you are running when that happens, we can't help solve this. If showing the actual code requires posting hundreds of lines of code, then start by removing parts of the kernel which are not causing a problem until you are left with something someone else could read, understand and compile, along with exact launch parameters. — talonmies, Sep 01 '12 at 16:01
@talonmies:First of all ,thanks for the interest.I updated,i hope you have the information you want (i have the kernel which runs when this happens).Thanks — George, Sep 01 '12 at 16:30
@George: Where is the kernel code? I see some code, but it contains lots of `...` which obviously is not the code which produces the errors you are asking about because it could not be compiled. — talonmies, Sep 01 '12 at 17:14
@talonmies:Here is the whole ,running code.I repeat that the part where the parallelism is ,is in the global function ,at the function 'computeEvec'. — George, Sep 01 '12 at 18:32

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

1

Using R=1000 and then

block=R/2,1,1 and grid=1,1 everything ok

If i try R=10000 and

block=R/20,1,1 and grid=20,1 ,then it show me "out of memory"

I'm not familiar with pycuda and didn't read into your code too deeply. However you have more blocks and more threads, so it will

local memory (probably the kernel's stack, it's allocated per thread),
shared memory (allocated per block), or
global memory that gets allocated based on grid or gridDim.

You can reduce the stack size calling

cudeDeviceSetLimit(cudaLimitStackSize, N));

(the code is for the C runtime API, but the pycuda equivalent shouldn't be too hard to find).

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 31 '12 at 17:14

Dude

583
2
9

:hello and thanks for the help.I can see (from information from ptxas) that until about 82000 bytes stack frame the program runs ok.But for more,it doesn't.Also,reducing the stack size gives me the same results. – George Sep 01 '12 at 10:38
:I can't understand also this: I do the parallelism for numPointsRp,but if i increase numPointsRs to 10000 for example it show me "cuLaunchKernel failed: invalid value". – George Sep 01 '12 at 13:55
1

A 1x1x1 grid makes no sense - you try to run all threads on just a single SM. Why do you have to copy the input redundantly to local memory? The code won't scale this way, so don't waste your time trying to make it run with hammer and crowbar. Instead try to really understand the device you are targeting, its types of memory, the C/C++ ABI and its limits and rewrite your program accordingly (see the CUDA Documentation or maybe my answer in [this post][http://stackoverflow.com/questions/12172279/how-the-kernel-is-launched-in-cuda/12187565#12187565] can help to get you started). – Dude Sep 02 '12 at 00:30

talonmies · Accepted Answer · 2012-09-02T16:38:14.820

1

When i use numPointsRp>2000 it show me "out of memory"

Now we have some real code to work with, let's compile it and see what happens. Using RowRsSize=2000 and RowRpSize=200 and compiling with the CUDA 4.2 toolchain, I get:

nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu 
ptxas info    : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21'
ptxas info    : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_
    122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]

The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn't hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.

As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:

typedef  pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];

__global__  
void computeEHfields2(
        float *Rs_mat_, int numPointsRs,
        float *Rp_mat_, int numPointsRp,
        cmplx *J_,
        cmplx *M_,
        cmplx  kp, 
        cmplx  eta,
        cmplx E[][3],
        cmplx H[][3], 
        cmplx *All )
{

    fp3 * Rs_mat = (fp3 *)Rs_mat_;
    cp3 * J = (cp3 *)J_;
    cp3 * M = (cp3 *)M_;

    int k=threadIdx.x+blockIdx.x*blockDim.x;
    while (k<numPointsRp)  
    {
        fp3 * Rp_mat = (fp3 *)(Rp_mat_+k);
        computeEvec2( Rs_mat, numPointsRs, J, M, *Rp_mat, kp, eta, E[k], H[k], All );
        k+=blockDim.x*gridDim.x;
    }
}

and the main __device__ function it calls to something like this:

__device__ void computeEvec2(
        fp3 Rs_mat[], int numPointsRs,   
        cp3 J[],
        cp3 M[],
        fp3   Rp,
        cmplx kp, 
        cmplx eta,
        cmplx *Evec,
        cmplx *Hvec, 
        cmplx *All)
{
 ....
}

and eliminate every byte of thread local memory without changing the functionality of the computational code at all.

edited Sep 02 '12 at 16:38

answered Sep 01 '12 at 20:03

talonmies

70,661
34
192
269

:So, it's a bad designed code,right?I need to alter it in order to use shared memory?Can you give me some tips (in my code) for that?from your experience i mean,without taking your time.Also,i get 63 registers ,why that difference? – George Sep 02 '12 at 12:46
:And last , you said the 438796288 are allocated before even the program runs.So,when i run it , the memory that is needed (the matrices i have as input) is adding to the 438796288 bytes? – George Sep 02 '12 at 12:49
On the memory question, yes that is how it works. But this code is a complete train wreck. Why does *every* thread load a local memory copy of the complete input data for? That is insanity, especially when most of the contents of `Rp_mat` loaded by any given thread are *never* used. Why have local memory copies at all? Why not just read from the arrays directly? So much of this code make so little sense that it is impossible for me to even begin to suggest how to "fix" it. – talonmies Sep 02 '12 at 15:51
:The problem is that i didn't begin the designing of this code to be used with cuda ,but with c++ ,that's why it is doesn't use good parallelism practices.So,i must design it again.Thank you very much for your help.(Can you give me an example of what you mean"Why not just read from the arrays directly") – George Sep 02 '12 at 15:57
:I think i made a mistake!I pressed the +100 boundy in order to give it to you!But it went to the other answer!1000 sorrys talonmies! – George Sep 02 '12 at 16:02
:I flaged it and left a message to moderators,i hope they will find a way! – George Sep 02 '12 at 16:12
1

@George: I really don't care about your bounty. It is probably because you edited your question so much that it was converted to a community wiki question, and so my answer was also make into a wiki entry, which are not eligible for reputation. By have a look at my edit, and then I suggest going away and thinking about this some more. It seems you are vastly overcomplicating things for reasons that are not obvious. – talonmies Sep 02 '12 at 16:45
:Talonmies thank you very much for your help again..I have one problem though..Inside computeEHfields2 as you wrote,you have "fp3 * Rp_mat = (fp3 *)(Rp_mat_+k);" .When i run it i get some results with nan values,when i omit it the results are ok.Also, you need to add "fp3 * Rp_mat=(fp3 *)Rp_mat_; ".And finally , now i don't get ptxas info any more as before.Only the first time i run the program (I am altering the parameters ,i kno it is cached). – George Sep 03 '12 at 09:39
:One last (this is very useful if i understand it).I am trying to make the code work for 1000000points Rp.For 512,1,1 threads and 200,1 blocks (and some more combinations)it shows me " a clean-up operation failed (dead context maybe?)".So , i compute:I have 57 registers*512threads=29184registers.My card has limit 32768registers,so 32768/29184=1 block/SM +some.(my card can have up to 6 blocks/SM ,so i can try lower threads (but in this case the occupancy factor reduces)).Let it be 1block/SM.Now,ihave 1block/SM*512threads/block*200blocks=102400threads*block/SM.Here,i give up , – George Sep 03 '12 at 16:11
:,i can't understand what to do more.Why it isn't running.What else must i take in to account?I have read that we can use blocks=(threads-1+pointsRp)/threads ,but this doesn't work either... – George Sep 03 '12 at 16:13
@George: I wrote that code in 5 minutes in the browser without even trying to compile it. It is supposed to give you an idea about how you might make this work. It isn't intended to be copied and used verbatim. I have no idea if it would work or any way of testing it. I also have no intention of debugging your code in comments on this question, I am afraid. – talonmies Sep 03 '12 at 18:46

cuda -- out of memory (threads and blocks issue) --Address is out of bounds

2 Answers2

Linked