why do the SSE and AVX have same efficiency?

Question

I use vs2012 and want to test the efficiency of SSE and AVX. The code for SSE and AVX is almost the same, except the SSE uses _m128 and AVX uses _m256. I expected the AVX code to be two times faster then the SSE code, But the test result shows their speed is almost the same.

I try to select the /arch:AVX or /arch:SSE or /NOT SET and comment the SSE code or comment the AVX code, whatever I test, the time used for SSE code is about 2138ms and AVX code is about 2106ms. The outer for loop is just used to increase the cycle time,

#include "testfun.h"
#include <iostream>
#include <time.h> 
#include <malloc.h>
#include "immintrin.h"
using namespace std;
#define dataLen  800000

void testfun()
{
float *buf1 = reinterpret_cast<float*>(_aligned_malloc( sizeof(float)*dataLen, 32 ));
float *buf2 = reinterpret_cast<float*>(_aligned_malloc( sizeof(float)*dataLen, 32 ));
for(int i=0; i<dataLen; i++)
{
    buf1[i] = 1;
    buf2[i] = 1;
}
double timePassed;
int t = clock();
float sum = 0;
//=========================SSE CODE=====================================
__m128 *p1 = (__m128 *)buf1;
__m128 *p2 = (__m128 *)buf2;
__m128 _result = _mm_set_ps1(0.0f);

for(int j=0;j<10000; j++)
{   
    p1 = (__m128 *)buf1;
    p2 = (__m128 *)buf2;        
    _result = _mm_sub_ps(_mm_set_ps(j,0,0,0) ,_result);

    for(int i=0; i<dataLen/4; i++)
    {
        _result = _mm_add_ps(_mm_mul_ps(*p1, *p2), _result);
        p1++;
        p2++;
    }
}

sum = _result.m128_f32[0]+_result.m128_f32[1]+_result.m128_f32[2]+_result.m128_f32[3];
timePassed = clock() - t;
std::cout<<std::fixed<<"SSE calculate result : "<<sum<<std::endl;
std::cout<<"SSE time used: "<<timePassed<<"ms"<<std::endl;

//=========================AVX　CODE=====================================
t = clock();
__m256  *pp1 ; 
__m256  *pp2 ; 
__m256 _rresult = _mm256_setzero_ps();
sum = 0;

for(int j=0;j<10000; j++)
{   
    pp1 = (__m256*) buf1;
    pp2 = (__m256*) buf2;
    _rresult = _mm256_sub_ps(_mm256_set_ps(j,0,0,0,0,0,0,0), _rresult);

    for(int i=0; i<dataLen/8; i++)
    {       
        _rresult = _mm256_add_ps(_mm256_mul_ps(*pp1, *pp2), _rresult);
        pp1++;
        pp2++;
    }
}

sum = _rresult.m256_f32[0]+_rresult.m256_f32[1]+_rresult.m256_f32[2]+_rresult.m256_f32[3]+_rresult.m256_f32[4]+_rresult.m256_f32[5]+_rresult.m256_f32[6]+_rresult.m256_f32[7];
timePassed = clock() - t;
std::cout<<std::fixed<<"AVX calculate result : "<<sum<<std::endl;
std::cout<<"AVX time used: "<<timePassed<<"ms"<<std::endl;

_aligned_free(buf1);
_aligned_free(buf2);

}

Paul R · Accepted Answer · 2013-08-30T10:27:56.490

5

You are most likely just bandwidth-limited, since you only have two arithmetic instructions in your loop and you have two loads. If you reduce the size of your data set so that it fits in cache you should then see a difference in performance (since you'll have much greater load bandwidth and reduced latency for loads from cache).

(Also, your timing numbers seem very high - make sure that you are using the release build, i.e. that you have optimisation enabled, otherwise your results will be misleading.)

edited Aug 30 '13 at 10:27

answered Aug 30 '13 at 10:22

Paul R

208,748
37
389
560

1

+1 Also cpus like bulldozer have two unifiable sse into one avx per module so it may not matter without memory operations too. I didnt try . – huseyin tugrul buyukisik Aug 30 '13 at 10:24
You are right,if I set the "dataLen" to 4000 , and remove the _mm_sub_ps and _mm_set_ps operation above the inner for loop,then the AVX is about two times faster than SSE. (in release mode, and set to"Advanced Vector Extensions (/arch:AVX) ). I still don't quite understand, why the datalen have such a big influence on the result. In actual situation, the array need to be calculated is always larger then 4000. Do you mean that if the data set is fit for the cache, then it will load only once in total, if the data set is larger then the cach, then it will load twice in every for loop? – myej Sep 03 '13 at 02:16
@myej: these days CPUs are much faster than memory, which is why we have increasingly large caches. To get best performance you need to minimise cache misses, which means doing as much work as possible on your data before it gets evicted from cache. If you only do a small number of operations then the cost of the cache miss and relatively low bandwidth DRAM access outweighs any computational optimisation. – Paul R Sep 03 '13 at 05:48
my cpu level1 cash 32kB, just can only hold 8000 floats, that is fit with the test. If the two arrays is all larger then 4000, then the AVX time will gradually over the SSE time – myej Sep 04 '13 at 10:37
1

The important thing is to optimise your software design so that you combine operations on data while it's in cache, i.e. instead of doing `func1(data); func2(data); func3(data);` where each func makes a complete pass through `data`, you *strip-mine* or *tile* the data, so that `func1`, `func2`, `func3` are called within a loop and they each process a cache-friendly subset of the data on each iteration. (This assumes of course that you have other operations on your data that you can combine like this.) – Paul R Sep 04 '13 at 10:46

why do the SSE and AVX have same efficiency?

1 Answers1