I am doing some fun testing of my GPU Nvidia video card. Set up a nice program that does operations on the intel CPU and then does them by vectorizing them on the GPU. It is cool, because I get about an 80x speed improvement when I use the GPU. However, with some simple programs, I have been able to just blow out the memory constraint and one of my programs, trashes the kernal.
import tensorflow as tf
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import gc
@vectorize(['float32(float32, float32)'], target='cuda')
def loopGPU(vStart, vEnd):
vPosition=vStart
vOriginal=vStart
for vStart in range(vEnd):
vPosition=vPosition+1
print('Counted from: ', vOriginal, ' to: ', vEnd, ' by: 1 ' )
return(vPosition)
def loop2aTrillion(vStart, vEnd):
start = timer()
vNewStart=loopGPU(vStart, vEnd)
duration = timer() - start
print('GPU Loop Time',duration)
loop2aTrillion(1,100000000) Counted from: 1.000000 to: 100000000.000000 by: 1 GPU Loop Time 3.7991681657728336
So it works great to about 100 million.. But If I go to a Billion or even 500 million.. I get this amazing error from CUDA
See my try to see where the breaking point is..
loop2aTrillion(1,100) Counted from: 1.000000 to: 100.000000 by: 1 GPU Loop Time 0.10281342493556167 loop2aTrillion(1,1000000) Counted from: 1.000000 to: 1000000.000000 by: 1 GPU Loop Time 0.048263507262902294 loop2aTrillion(1,100000000) Counted from: 1.000000 to: 100000000.000000 by: 1 GPU Loop Time 3.7804056272377693 loop2aTrillion(1,100000000) Counted from: 1.000000 to: 100000000.000000 by: 1 GPU Loop Time 3.797257584627573 loop2aTrillion(1,1000000000) Traceback (most recent call last): File "", line 1, in File "", line 3, in loop2aTrillion File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\cuda\dispatcher.py", line 88, in call return CUDAUFuncMechanism.call(self.functions, args, kws) File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\npyufunc\deviceufunc.py", line 311, in call return devout.copy_to_host().reshape(outshape) File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\cuda\cudadrv\devices.py", line 212, in _require_cuda_context return fn(*args, **kws) File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\cuda\cudadrv\devicearray.py", line 252, in copy_to_host _driver.device_to_host(hostary, self, self.alloc_size, stream=stream) File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\cuda\cudadrv\driver.py", line 1776, in device_to_host fn(host_pointer(dst), device_pointer(src), size, *varargs) File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\cuda\cudadrv\driver.py", line 288, in safe_cuda_api_call self._check_error(fname, retcode) File "D:\Anaconda\envs\tensorflow\lib\site-packages\numba\cuda\cudadrv\driver.py", line 323, in _check_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_FAILED
What is fun about this error, is the kernal is trash and I have to start over.. "it must be destroyed" love it!!!!
CUDA_ERROR_LAUNCH_FAILED An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
OK, so now you know my background. Here are my questions. 1. How do I now how much is too much ( off line )? Is there some type of python math or library that is good at knowing where I am memory wise while I am interactive?
Can I check for this situation "on line"? So either before I call the vectorized fucntion, can I estimate live, if I am going to blow up memory and trash my kernal? Or can I do it in the Vectorized procedure and have more control or insight to the memory allocation?
Will a Try / Exception block save me? Or is the memory allocation too deep in the CUDA that once it blows up, it is gone, and the pythong wrapper cannot save the kernal.. "It MUST BE DESTROYED"...
Thanks, as you can see, just trying to push how many calculations I can run through on these GPUs..