Two versions of the same algorithm yield different total instruction fetch counts and cycle estimations under valgrind/cachegrind. The difference is about 25%. Process timing, however, is very similar (it is actually shorter for the cachegrind-slow version):
version 1:
Ir: 146,328,018,245 CEst: 152,553,736,055 timing: 17.93 s
version 2:
Ir: 185,221,836,610 CEst: 197,531,381,950 timing: 17.53 s
Is this behaviour expected? How can I learn more about why version 1 is slower?