storing values between iterations (cache-like mechanism) in pyCUDA

Question

Good morning all,

I am kind of newbie with cuda/pyCuda, so probably this will have an easy solution employing some mechanism that I don't know....

I am employing pycuda to operate over pairs of values: I subtract the smallest from the biggest and then perform some time-consuming operations. As it must be repeated many times, it is well suited for GPUs.

However, most of the times the result of the substraction is the same. Then, performing the time-consuming operations make no sense. what I do in the non-GPU version of my code is something like:

myFunction(A,B):
    index=A-B
        try:
        value = myDictionary[index]
    except:
        value = expensiveOperation(index)
        myDictionary[index] = value
    return value

As accessing the dictionary is much faster than expensiveOperation, and the value is found most of the times, I obtain a significant time gain.

When porting this to GPUs, I can call to myFunction(A,B) with a high degree of parallelism, which is great. However, I don't know how could I employ this dictionary mechanism -or a similar one- to avoid redundant operations.

any thoughts on this?

Thanks for your help

edit: I would like to know, is it possible to store the dictionary on the GPU, or should I copy it every time? If it's on the GPU, can it be accessed/edited by several cores at the same time? How should I implement it?

So your question is actually about memoisation mechanisms in GPU code? — talonmies, Feb 06 '14 at 07:20

score 1 · Answer 1 · answered Feb 04 '14 at 11:45

1

You could try this:

myFunction(A,B):
    index=A-B
    if index in myDictionary.keys():
        value = myDictionary[index]
    else:
        value = expensiveOperation(index)
        myDictionary[index] = value
    return value

answered Feb 04 '14 at 11:45

venpa

4,268
21
23

The problem for me is not the "try/except", but how to implement the dictionary mechanism in pyCUDA. Should I pass it to the kernel every iteration? Is it possible to keep it in the GPU memory, thus avoiding the need to copy? Can it be concurrently accessed/updated by several GPU nodes at the same time? – Fary El Grande Feb 04 '14 at 13:09

score 0 · Accepted Answer · edited May 23 '17 at 11:49

It seems your question is about implementing some sort of memoise facility inside GPU code. I don't think this is worth pursuing. In the GPU arithmetic operations are almost free, but memory access is very expensive (and random memory access even more so). Performing a dictionary/hash table look-up in GPU memory to retrieve an arithmetic result from a cache is almost guaranteed to be slower that the cost of just calculating the result. It sounds counter-intuitive, but that is the reality of GPU computing.

In an interpreted language like Python, which is relatively slow, using a fast native memoisation mechanism makes a lot of sense, and memoising the results of a complete kernel function call also could yield useful performance benefits for expensive kernels. But memoisation inside CUDA doesn't seem all that useful.

storing values between iterations (cache-like mechanism) in pyCUDA

2 Answers2