2

Is there any significant or fundamental difference between the penalty of cache misses and branch mispredictions for ARM and x86/64 processors?

I understand that mileage may vary depending on concrete model and whole configuration of a machine. But still wondering if there's anything.

rotanov
  • 87
  • 7
  • AFAIK no because these are known optimized subjects (theoretically speaking). – m0skit0 Jan 18 '16 at 07:39
  • just among the x86/64 processor implementations and more importantly motherboard implementations your results will vary, so kind of pointless worrying beyond that to other architectures. arm implementations are going vary even more than the x86 vary among themselves as arm only covers a fraction of the chip, the memory interface is someone elses IP, as is arbitration with peripherals and other busses. – old_timer Jan 18 '16 at 18:30

1 Answers1

1

Fundamentally a ~32MHz 3-stage Cortex-M0 pipeline works the same way as a ~3GHz 40-stage NetBurst P4 pipeline - if the next instruction/data isn't available yet, you're just going to have to wait until it is.

Actual cycle counts, timing, and everything else will depend on many different microarchitecture/system/implementation details and vary hugely even within a single architecture (compare said NetBurst P4 to a 486DX-40, or said Cortex-M0 to an X-Gene 2, for example).

Notlikethat
  • 20,095
  • 3
  • 40
  • 77
  • IIRC, mispredict penalty on a modern CPU like Intel Haswell is something like 15 cycles when running from the uop cache, 19 when instructions are coming from the decoders directly. The mispredict penalty isn't the full pipeline length, it's fetch up to the point where mispredicts can be detected. (And modern Intel CPUs are clever about not flushing things that don't need to be flushed when a mispredict is detected.) You can find numbers for many x86 microarches at http://agner.org/optimize/. – Peter Cordes Jan 18 '16 at 10:56