array synchronization and multithreading

Question

I have a multithreaded application where I'm doing the following:

Thread1: Reads data from a camera and stores it in ArrayFire array (CUDA backend).

Thread2: Use ArrayFire to calculate certain results from the camera's data.

Thread3: Send the results over the network. Calls host function on the array.

The problem that I have is that when I do the last part of sending results over the network I need to copy the data from GPU to CPU using the host function. This causes jitter (varying delay).

When looking at the source code of ArrayFire I notice that host uses CUDA stream synchronization method which (if I understood correctly) forces the only CUDA stream where all the threads are running to finish all the tasks that its doing.

This causes the Thread3 to wait for the Thread2 to finish its ArrayFire calculation (if any occuring at that time) and it causes random jitter in Thread3 where I need to call host in order to copy the array to CPU memory.

Are my assumptions correct, and if so, any suggestions on what to do?

Check _memcpyasync host to device_ is synchronized. If you are using _cudaStream_, you probably use asynchronize function of CUDA. I think that your problem might be some CUDA functions are not synchronized. — Jaehong Lee, Jan 24 '20 at 05:43
Im only using arrayfire API with CUDA backend. ArrayFire uses CUDA streams internally - a single stream if im not mistaken. I'm suspecting that is the cause for my issues but i dont know how to go around the problem — parti82, Jan 24 '20 at 06:14
An answer to this question is going to be difficult without looking the the timeline of your application and analyzing the code. There are too many moving parts here and we would be shooting in the dark to find a solution. Yes ArrayFire uses one stream but its only going to be a problem if the memcpy is taking a significant portion of your time. We have done work with these types of applications for our clients and it does require some finesse to get it right for each platform. I would suggest you look at the profiler and look for some clues. — Umar Arshad, Jan 29 '20 at 06:59
Thanks @UmarArshad! I'm using Nvidia Tegra TX2 if that helps. Let's assume Thread1 is running at 500Hz - meaning it stores a new frame from the camera every 2ms (to a pinned memory). Thread2 is reading multiple frames at a time, let's assume 10 frames, which makes it do processing every 20ms. Thread2's processing could take roughly 10ms (for those 10 frames, which averages about 1ms per frame). If Thread3 is sending a new frame to network - worst case it might need to wait for ~10ms until it can send that single frame. Is there any way to add more CUDA streams to AF? Any suggestions? — parti82, Jan 29 '20 at 19:48

score 1 · Answer 1 · answered Jan 29 '20 at 21:49

Allocate 4 device pointers and pass two to thread 1 and two to thread 3. These are going to be memory you are going to use to send data between ArrayFire and your threads.

Create 2 streams using the CUDA API. One stream is going to be used by thread 1 and one by thread 3. Also call afcu::getStream and get the stream used by ArrayFire.

Thread one is going to use its stream to cudaMemcpyAsync to one of the device pointers given to it(Make sure you use pinned memory on the host). Once it is done, you will record an event on your stream and call cudaStreamWaitEvent on ArrayFire's stream. This will tell ArrayFire to wait for that event before moving forward with the computation. In the next iteration of Thread 1, write the data to the second pointer and so on.

In the ArrayFire thread. you will call the af::write on the pointer copy to your input array. You can treat it like any other array. Once you are done, get the device pointer from the result array and call cudaMemcpyAsync to the pointer given to Thread 3. You will probably want to record and wait for events on here as well.

This should give you sufficient overlap between the three threads.

score 0 · Answer 2 · answered Feb 20 '20 at 14:31

I’ve tried your suggestion. It does not work reliably and there are memory access artifacts.

First the code:

input thread:

cudaStream_t m_stream;
cudaEvent_t m_streamEvent;

cudaStreamCreateWithFlags(&m_stream, cudaStreamNonBlocking);
cudaEventCreate(&m_streamEvent);

int bytesPerPixel = 2;
int bytes = width * height * bytesPerPixel;

while(!stop)
{
    // Read from file to m_imageData

    // Next buufer is pinned memory allocated with af::pinned
    nextBuffer = getWriteBuffer();

    cudaMemcpyAsync(nextBuffer, m_imageData.data() + m_imageOffset, bytes, cudaMemcpyHostToDevice, m_stream);
    cudaEventRecord(m_streamEvent, m_stream);
    cudaStreamWaitEvent(m_stream, m_streamEvent, 0);   

    m_imageOffset =  (m_imageOffset + bytes) % m_imageData.size();
}

Processor thread:

tile x frames (batch) from ring buffer and push the result to a queue

Output thread:

cudaStream_t m_stream;
cudaEvent_t m_streamEvent;

cudaStreamCreateWithFlags(&m_stream, cudaStreamNonBlocking);
cudaEventCreate(&m_streamEvent);

int rgbaBufferSize = width * 4;
auto m_rgbaPinnedBuffer = af::pinned<quint8>(rgbaBufferSize);

while(!stop)
{
    rgba = m_queue.dequeue();
    rgbaAfBuffer = rgba.device<quint8>();
    cudaMemcpyAsync(m_rgbaPinnedBuffer, rgbaAfBuffer, rgbaBufferSize, cudaMemcpyDeviceToHost, m_stream);
    cudaEventRecord(m_streamEvent, m_stream);
    cudaStreamWaitEvent(m_stream, m_streamEvent, 0);
    rgba.unlock();

    // Do something with m_rgbaPinnedBuffer
}

When launched with NVidia's profiler, I can see the AF stream, I can see the output thread stream with memcpy outs. I cannot see the input thread stream with memcpy in. I don't know why, although the stream and event creation are reported successful.

When using af::host, there's is no memory access issue, and I can see the cudaMemcpyAsync happening in the default stream. The ouput RGBA looks like this

When using cudaMemcpyAsync I can see the cudaMemcpyAsync in the stream timeline but sometimes memory is repeated. It happens more when I increase the batch size or move other application windows fast stealing GPU time. See the output RGBA

Have you bumped into this kind of issue?

Please open a separate question, no one would see your question if you post your question as an answer. — Hykilpikonna, Dec 10 '20 at 01:14

array synchronization and multithreading

2 Answers2