The OMAP3530 implements an ARM processor and a C64x+ DSP. I have a test loop that I expect to run faster on the DSP than on the ARM, but this is not the case.
Loop:
#define DIM 4
#define LIM 1000
#define MASK 3
int i, j;
uint32 arr[DIM][DIM] = {0};
uint32 rand[DIM][DIM] = {1, 5, 2, 7,
5, 4, 3, 8,
1, 2, 9, 3,
6, 6, 8, 4};
for (i = 0; i < LIM; i++)
for (j = 0; j < LIM; j++)
arr[i & MASK][j & MASK] += rand[i & MASK][j & MASK];
Benchmarks:
ARM: 5ms
DSP: 25ms
The point of the DSP is to handle simple arithmetic operations like this, so I would have expected it to be faster. I haven't done much configuration with the DSP, since I'm pretty inexperienced with it. I believe the cache is not configured, so am looking into that, but would welcome any other suggestions.
Could anybody advise on a possible solution?
EDIT - Changed the LIM
value to 5000 to increase the # of iterations. New benchmarks:
ARM: 120ms
DSP: 530ms