DMA memcpy operation in Linux

Question

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:

void foo ()
{
    int index = 0;
    dma_cookie_t cookie;
    size_t len = 0x20000;

    ktime_t start, end, end1, end2, end3;
    s64 actual_time;    

    u16* dest;
    u16* src;

    dest = kmalloc(len, GFP_KERNEL);
    src = kmalloc(len, GFP_KERNEL);

    for (index = 0; index < len/2; index++)
    {
        dest[index] = 0xAA55;
        src[index] = 0xDEAD;
    }

    start = ktime_get();
    cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);

    while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
    {
       dma_sync_wait(chan, cookie);
    }
    end = ktime_get();
    actual_time = ktime_to_ns(ktime_sub(end, start));
    printk("Time taken for function() execution     dma: %lld\n",(long long)actual_time);   

    memset(dest, 0 , len);

    start = ktime_get();
    memcpy(dest, src, len);

    end = ktime_get();
    actual_time = ktime_to_ns(ktime_sub(end, start));
    printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}

There are some issues with DMA:

Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)

Do you mean: "*IS* DMA operation possible at application level?" — Joe, Aug 27 '14 at 07:46
Could you explain what exactly are you trying to achieve here, especially given the question about "application level"? Generally, trying to replace regular `memcpy` or COW mapping/sharing calls with `dma_async_memcpy_buf_to_buf` (for the sake of what? performance gains?) is really weird. Comparing async and sync kernel operations (especially with such a crude tool as `ktime_get()` is also IMHO kind of pointless... — GreyCat, Aug 27 '14 at 08:11
And, yeah, you aren't supposed to use such mechanisms with `0x20000` bytes. Usually it's worth talking about when we're at least exceeding current CPU cache sizes by an order (i.e. starting with tenths of megabytes) *and* you somehow understand that copying this data directly would badly affect current cache / prefetcher state. — GreyCat, Aug 27 '14 at 08:18
I want to measure cpu time usage of regular "memcpy operation" and "DMA operation using dma_async_memcpy_buf_to_buf". Above code runs in the kernel as a module, but the time value obtained by ktime_get() shows that memcpy operation takes less time than DMA. Is there another utility to measure the timings? Also apart from this, I want to learn if it is possible to do DMA at application level instead of kernel. @GreyCat — Mustafat, Aug 27 '14 at 08:24

score 13 · Accepted Answer · answered Aug 27 '14 at 08:59

There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:

Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

Thank you @GreyCat for useful information. In fact, I need to test dma capability of "Freescale P2041RDB". For this, I want to perform DMA using USB storage device. There is a documentation about this process under linux/Documentation/usb/dma.txt. According this document, usb_alloc_coherent function (linux/drivers/usb/core/usb.c) provides DMA capability. Anyway, when I plug USB storage device, I see that usb_alloc_coherent function is called and returns DMA address of buffer. Does this mean the usb driver perform the DMA when USB device is plugged in? If so how can I verify it? — Mustafat, Aug 27 '14 at 12:41
It's a completely different question :) First of all, make sure you know the difference between "coherent" and "streaming" DMA - see https://www.kernel.org/doc/Documentation/DMA-API.txt if you're not sure. If "coherent" is still what you want, then just go and use it - it returns a valid pointer that can be used to read from/write to. As a very generic example, I can recommend checking out [LDD3 examples code for USB driver skeleton](https://github.com/hunterhu/ldd3-examples-updated/blob/master/ldd3-examples/usb/usb-skeleton.c). — GreyCat, Aug 27 '14 at 14:55

DMA memcpy operation in Linux

1 Answers1