OMAP3530: Loop runs slower on DSP than on ARM

Question

The OMAP3530 implements an ARM processor and a C64x+ DSP. I have a test loop that I expect to run faster on the DSP than on the ARM, but this is not the case.

Loop:

#define DIM 4
#define LIM 1000
#define MASK 3

int i, j;
uint32 arr[DIM][DIM] = {0};
uint32 rand[DIM][DIM] = {1, 5, 2, 7,
                         5, 4, 3, 8,
                         1, 2, 9, 3,
                         6, 6, 8, 4};

for (i = 0; i < LIM; i++)
    for (j = 0; j < LIM; j++)
        arr[i & MASK][j & MASK] += rand[i & MASK][j & MASK];

Benchmarks:

ARM: 5ms
DSP: 25ms

The point of the DSP is to handle simple arithmetic operations like this, so I would have expected it to be faster. I haven't done much configuration with the DSP, since I'm pretty inexperienced with it. I believe the cache is not configured, so am looking into that, but would welcome any other suggestions.

Could anybody advise on a possible solution?

EDIT - Changed the LIM value to 5000 to increase the # of iterations. New benchmarks:

ARM: 120ms
DSP: 530ms

How are you benchmarking? Are you measuring the DSP time on the ARM side, or do you use the processor cycle registers on the DSP? The transition from ARM to DSP and back takes a *lot* of time, and your function isn't really that large that I would expect the code to take any serious time. _itoll(TSCH, TSCL) will give you the numbers of cycles expired on the DSP as a 64 bit result. — Nils Pipenbrinck, Nov 15 '15 at 20:02
And yes, you definitely want the caches enabled unless you rely only on tightly coupled memory and do everything else with DMA. A cache miss can easily take more than 400 cycles. You could just as well execute 3200 instructions in that time (8 instructions per clock). — Nils Pipenbrinck, Nov 15 '15 at 20:05
@NilsPipenbrinck Benchmarking with the GP timer module on the ARM side. Transition from ARM to DSP and back is taking ~3.5ms (this is the time when no looping is done on the DSP, just using the IPC to send a 'start' message and a 'done' message). I've updated the question to make the loop longer (changed LIM to 5000). I think we can neglect the IPC time — Voriki, Nov 15 '15 at 20:22
@NilsPipenbrinck About the TCM: This test case is part of a larger loop that EDMAs in rows from array A (defined on the ARM), works with them to modify arrays B and C (define don the DSP), and then memcpys B and C back to the ARM. I'm new to this, but I don't think we explicitly use TCM, and it might show some benchmark improvements. Do you have any references on how to incorporate it? — Voriki, Nov 15 '15 at 20:25

Marcus Müller · Answer 1 · 2015-11-16T08:59:24.470

I've seen this happen before. Using the DSP pays off in very specific scenarios, only. A million additions surely is not the use case – it's not like the ARM A8 is terribly bad at adding numbers, so you're running code that would be highly efficient on the ARM on a slower coprocessor. That simply won't speed things up.

The specific OMAP's you're looking at has an ARM Cortex A8 core with NEON, which means it has single-instruction-multiple-data Multiply/Accumulate instructions. Those should even be faster than just letting the compiler implement your loop as efficiently as it can, in my experience. Mileage might vary, though, assuming that somewhere down the line you're doing multiplications, too.

If you want to unleash the power of hand-optimized, intrinsics-rich platform-specific code, have a look at VOLK, which is a spin-off from the GNU Radio project, providing a Vector Optimized Library of Kernels, covering a generic implementation, x86/MMX/SSE2/AVX for most of the kernels, and a NEON implementation for some of them. Of specific interest to your problem might be the 16i_x5_add_quad_16i_x4 kernel.

In conclusion: Unless you're sure the C64x has a lot of advantages over the rather capable OMAP, I wouldn't use it. You mention that this is part of a larger loop on the DSP, but you don't have the means yet to count the cycles your algorithm took on the DSP – I'd recommend getting your development setup into a state where it's easy to decide how good your implementation is. The general purpose timers on the ARM surely aren't a good benchmark.

OMAP3530: Loop runs slower on DSP than on ARM

1 Answers1