3

Problem :

I converted a MMX to code to corresponding SSE2 code. And I expected almost 1.5x-2x speedup. But both took exactly same time. Why is it?

Scenario:

I am learning SIMD instruction set and their performance comparison. I took an array operation such that, Z = X^2 + Y^2 where X and Y are large one dimensional array of type "char". The values of X and Y are restricted to be less than 10, so that Z is always <255 (1 Byte). ( Not to worry about any overflow).

I wrote its C++ code first, checked its time. Then wrote corresponding ASSEMBLY code (~3x speedup). Then I wrote its MMX code (~12x v/s C++). Then I converted MMX into SSE2 code and it takes exactly same speed as that of MMX code. Theoretically, in SSE2, I expected a speedup of ~2x compared to MMX.

For conversion from MMX to SSE2, I converted all mmx reg to xmm reg. Then changed a couple of movement instructions and so on.

My MMX and SSE codes are pasted here : https://gist.github.com/abidrahmank/5281486 (I don't want to paste them all here)

These functions are later called from main.cpp file where arrays are passed as arguments.

What I have done :

1 - I went through some optimization manuals from Intel and other websites. Main problem with SSE2 codes is the 16 _memory alignment. When I manually checked the addresses, they all are found to be 16 _memory aligned. But I used both MOVDQU and MOVDQA, but both gives the same result and no speedup compared to MMX.

2 - I went to debug mode and checked each register values with instructions executed. And they are being executed exactly same as I thought, ie 16 bytes are taken and resulting 16 bytes are outputted.

Resources :

I am using Intel Core i5 processor with Windows 7 and Visual C++ 2010.

Question :

So final question is, why there is no performance improvement for SSE2 code compared to MMX code ? Am I doing any thing wrong in SSE code ? Or is there any other explanation ?

Mysticial
  • 464,885
  • 45
  • 335
  • 332
Abid Rahman K
  • 51,886
  • 31
  • 146
  • 157
  • There's an `emms` in your SSE code, but that shouldn't be causing this. How long are your arrays? – harold Mar 31 '13 at 19:54
  • @harold: arrays are big, size = 100000000 bytes. I made it big to see the time. Otherwise it happens in 0 time. – Abid Rahman K Mar 31 '13 at 20:08
  • 1
    That's a problem though. That's much bigger than the cache, so the bottle neck is main-memory throughput. – harold Mar 31 '13 at 20:09
  • can you explain a little more? – Abid Rahman K Mar 31 '13 at 20:11
  • 3
    The array doesn't even nearly fit in the cache, so it has to come from main memory. That's very slow. Apparently slower than the calculation you're doing, and that isn't really a surprise. I suggest you change the arrays so that they all at least fit in L3, and then run the benchmark several times over the same array if necessary. – harold Mar 31 '13 at 20:20
  • ok. But I thought since it is same in both MMX and SSE, it should show some difference, right? – Abid Rahman K Mar 31 '13 at 20:23
  • @harold : Yeah, you are right, changing array size results in difference in performance. – Abid Rahman K Apr 01 '13 at 01:46

1 Answers1

4

Harold’s comment was absolutely correct. The arrays that you are processing do not fit into cache on your machine, so your computation is entirely load store bound.

I timed the throughput of your computation on a current-generation i7 for various buffer lengths, and also the throughput of the same routine with everything except for the loads and stores removed:

throughput

What we observe here is that once the buffer gets so big that it is out of the L3 cache, the throughput of your computation exactly matches the achieved load/store bandwidth. This tells us that how you process the data makes essentially no difference (unless you make it significantly slower); the speed of computation is limited by the ability of the processor to move data to/from memory.

If you do your timing on smaller arrays, you will see a difference between your two implementations.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Yeah, you are right. I changed arrays to 1MB size and now there is difference. But again one problem, Now my SSE2 code is `3-5x` faster than MMX. Can it be possible? I thought it should be nearly ~2x. ( I am calling C++, MMX and SSE in same file, one after the another. So will it be due any cache effect ?) – Abid Rahman K Apr 01 '13 at 01:40
  • One more question, In your plot for very high data, arithmetic operation becomes equal to the load/store operation. Why is that? – Abid Rahman K Apr 01 '13 at 01:43
  • Nice graph, what tool did you use to produce it? – Cameron May 22 '16 at 21:30
  • @Cameron: Numbers on OS X, nothing fancy. – Stephen Canon May 23 '16 at 04:04