ipad2 neon floating point speed versus ipad1

Question

When testing NEON instructions on ipad1 and ipad2, I notice allmost no speed up in the code on ipad2, where most functions actually run much faster on ipad2 than on ipad1.

This is for instructions like VMUL, VLD1, VADD and VSUB etc using quad word registers like q0 for things like FFT.

In addition I notice that apples own FFT function vdsp_fft_zrip does not speed up for ipad2 either.

So the question is, does ipad2 NEON execute faster than ipad1 NEON engine for the quad word SIMD type instructions?

The "VFP" tag here on StackOverflow indicates "Visual FoxPro"; you probably want to remove from your question. — Tamar E. Granor, Jun 28 '11 at 20:59

score 1 · Answer 1 · answered Nov 04 '11 at 13:14

The NEON unit on the A4 was extraordinarily powerful compared to the rest of the core. The rest of the core on the A5 is much improved from A4, but the NEON unit's performance is more-or-less unchanged. What you are observing is expected.

Of course, there are now two cores, so if you can take advantage of both of them, you can still see significant speedups. Also, double-precision computation on the A5 is vastly improved from the A4, as it is now fully pipelined.

score 0 · Answer 2 · answered Nov 02 '11 at 11:46

0

NEON will remain the same for quite a while, even on the recently introduced 64bit ARM.

NEON doesn't benefit much from increased clock speed. NEON is already so fast that it spends the majority of the function execution time waiting for the data from memory.

answered Nov 02 '11 at 11:46

Jake 'Alquimista' LEE

6,197
2
17
25

1

Well-written NEON code should **not** be spending most of its time waiting for data. If you find yourself in that situation, look for ways to do more work with the data you are loading. – Stephen Canon Nov 04 '11 at 13:16
@StephenCanon That's the theory. In reality, memory is much slower than you would love to. My very-well-written image scaling routine with zero hazard and dual-issue-everywhere-possible scheduling runs just as fast as memcpy, and it spends the majority of the execution time waiting for the data from memory. – Jake 'Alquimista' LEE Nov 04 '11 at 16:56
Of course, that's entirely possible. However, if the image scaling code can be combined with other operations on image tiles, so that the data is coming from L1 cache instead of from memory, then you will not (in general) see that sort of effect. – Stephen Canon Nov 04 '11 at 17:23
@StephenCanon Why are you trying to teach me such self explanatory things? What does make you think I don't know what you know? How do you know there were actually more work to do with the pixels besides the scaling in MY job? Why don't you answer other people's questions with your optimized codes? What is the data already loaded into a register supposed to have anything to do with the L1 cache? And most of all, why can't you admit anything? – Jake 'Alquimista' LEE Nov 04 '11 at 18:50

ipad2 neon floating point speed versus ipad1

2 Answers2