1

In the worst case, does this sample allocate testCnt * xArray.Length storage in the GPU global memory? How to make sure just one copy of the array is transferred to the device? The GpuManaged attribute seems to serve this purpose but it doesn't solve our unexpected memory consumption.

void Worker(int ix, byte[] array)
{
    // process array - only read access
}

void Run()
{
    var xArray = new byte[100];
    var testCnt = 10;
    Gpu.Default.For(0, testCnt, ix => Worker(ix, xArray));
}

EDIT

The main question in a more precise form: Does each worker thread get a fresh copy of xArray or is there only one copy of xArray for all threads?

Attila Karoly
  • 951
  • 5
  • 13

2 Answers2

2

Your sample code should allocate 100 bytes of memory on the GPU and 100 bytes of memory on the CPU. (.Net adds a bit of overhead, but we can ignore that)

Since you're using implicit memory, some resources need to be allocated to track that memory, (basically where it lives: CPU/GPU).

Now... You're probably seeing a bigger memory consumption on the CPU side I assume.

The reason for that is possibly due to kernel compilation happening on the fly. AleaGPU has to compile your IL code into LLVM, that LLVM is fed into the Cuda compiler which in turn converts it into PTX. This happens when you run a kernel for the first time. All of the resources and unmanaged dlls are loaded into memory.

That's possibly what you're seeing.

testCnt has no effect on the amount of memory being allocated.

EDIT*

One suggestion is to use memory in an explicit way. Its faster and more efficient:

    private static void Run()
    {
        var input = Gpu.Default.AllocateDevice<byte>(100);
        var deviceptr = input.Ptr;

        Gpu.Default.For(0, input.Length, i => Worker(i, deviceptr));

        Console.WriteLine(string.Join(", ", Gpu.CopyToHost(input)));
    }

    private static void Worker(int ix, deviceptr<byte> array)
    {
        array[ix] = 10;
    }
redb
  • 512
  • 8
  • 22
  • 1
    redb, thank you for your answer. This is also my thinking that testCnt shouldn't affect GPU memory consumption; however, working with input size in the order of 100kB and increasing testCnt there's a certain point where the screen gets blank for a second and when it returns a GPU error message appears. testCnt * inputSize is not far from the GPU memory capacity at this point. – Attila Karoly Oct 04 '17 at 07:25
  • What GPU you are using? Does it have enough memory? Is it a integrated GPU on your motherboard? Also, if you try the explicit memory usage as @redb suggested, does the screen still get blank? BTW, in the explicit way, you'd better call `input.Dispose` or use `using` keyword, otherwise, GC will collect your input, which results invalid `deviceptr`. – Xiang Zhang Oct 04 '17 at 08:49
  • Xiang, we've tested our code in various boards, all of them having enough memory providing the input array is transferred to the GPU once. The explicit model also fails. Where in the explicit model do you suggest using input.Dispose? – Attila Karoly Oct 04 '17 at 09:11
  • @AttilaKaroly what is the error message that you get? – redb Oct 04 '17 at 12:09
  • 1
    It starts like this: Unhandled Exception: System.Exception: [CUDAError] CUDA_ERROR_UNKNOWN at Alea.CUDAInterop.cuSafeCall@2939.Invoke(String message) at A.cf5aded17df9f7cc4c132234dda010fa7.Copy@827-22.Invoke(Unit _arg9) at Alea.Memory.Copy(FSharpOption`1 streamOpt, Memory src, IntPtr srcOffset, Memory dst, IntPtr dstOffset, FSharpOption`1 lengthOpt) – Attila Karoly Oct 04 '17 at 14:10
  • Any possibility you could share a minimal code sample that reproduces this? Your `Worker` kernel is empty, without looking at its code is very hard help you. An error like this could be anything. – redb Oct 04 '17 at 14:14
  • About your edit. All threads get access to the same block of memory. The `xArray` contents are **not copied**. The only thing copied is the address of `xArray` – redb Oct 04 '17 at 14:19
  • redb, my kernel code is rather long and indeed this is my latest idea that it takes too much time to execute. Although less than the kernel execution time limit, which is 10s, but not far from that. My next step is to restructure the algorithm. Thanks to you and Xiang for your kind help. – Attila Karoly Oct 04 '17 at 14:36
1

Try use explicit memory:

static void Worker(int ix, byte[] array)
{
    // you must write something back, note, I changed your Worker
    // function to static!
    array[ix] += 1uy;
}

void Run()
{
    var gpu = Gpu.Default;
    var hostArray = new byte[100];
    // set your host array
    var deviceArray = gpu.Allocate<byte>(100); 
    // deviceArray is of type byte[], but deviceArray.Length = 0, 
    assert deviceArray.Length == 0
    assert Gpu.ArrayGetLength(deviceArray) == 100
    Gpu.Copy(hostArray, deviceArray);
    var testCnt = 10;
    gpu.For(0, testCnt, ix => Worker(ix, deviceArray));
    // you must copy memory back
    Gpu.Copy(deviceArray, hostArray);
    // check your result in hostArray
    Gpu.Free(deviceArray);
}
Xiang Zhang
  • 2,831
  • 20
  • 40