0

I've been implementing basic math operations using altivec as a way to learn simd for an upcoming project. Also, just as a way to see the performance benefit of it, I track how long it takes to perform the operations, but I came across something odd.

The first thing I did was add two vectors together and subtract two vectors. This works fine. The next thing I did was multiply two vectors together. However, multiplying is faster than adding, even though less clock cycles are used to add verses multiplying according to what my particular CPU's datasheet says about the instructions being used.

I have two arrays that are each 10MBs large and run them through these two routines:

void av_AddValues(int32_t* intArrayA, int32_t* intArrayB, int32_t* outputBuffer, int size)
{
  int iterations = size / (sizeof(__vector int32_t) / sizeof(int32_t));

  __vector int32_t* tempA = (__vector int32_t *) intArrayA;
  __vector int32_t* tempB = (__vector int32_t *) intArrayB;
  __vector int32_t* tempOut = (__vector int32_t *) outputBuffer;
  for(int i = 0; i < iterations; i++)
  {
    __vector int32_t sum = vec_add(*tempA, *tempB);
    vec_st(sum, 0, tempOut);

    tempA++;
    tempB++;
    tempOut++;
  }
}

  void av_MultiplyValues(int16_t* intArrayA, int16_t* intArrayB, int32_t* outputBuffer, int size)
  {
    int iterations = size / (sizeof(__vector int16_t) / sizeof(int16_t));
    __vector int16_t* tempA = (__vector int16_t *) intArrayA;
    __vector int16_t* tempB = (__vector int16_t *) intArrayB;
    __vector int32_t* tempOut = (__vector int32_t *) outputBuffer;


    for(int i = 0; i < iterations; i++)
    {
      __vector int32_t productEven = vec_mule(*tempA, *tempB);
      __vector int32_t productOdd = vec_mulo(*tempA, *tempB);

      __vector int32_t mergedProductHigh = vec_mergeh(productEven, productOdd);
      __vector int32_t mergedProductLow = vec_mergel(productEven, productOdd);

      vec_st(mergedProductHigh, 0, tempOut);
      tempOut++;
      vec_st(mergedProductLow, 0, tempOut);

      tempA++;
      tempB++;
      tempOut++;
    }
  }

On my particular platform, av_AddValues takes 81ms to process and av_MultiplyValues takes 48ms to process. (Times recorded using std::chrono::high_resolution_clock)

Why does multiplying take less time to process than adding?

I don't think adding 32bit values verses multiplying 16bit values makes a difference considering the __vector type always processing 16 bytes of data.

My first thought was that since adding numbers together is such a trivial task, the CPU finishes the operation faster than it can fetch data from memory. Whereas with multiplying, this latency of fetching is hidden by the fact the CPU is busy doing work and never has to wait as long.

Is this a correct assumption to make?

Full code:

#include <chrono>
#include <random>
#include <limits>

#include <iostream>
#include <cassert>
#include <cstring>
#include <cstdint>
#include <malloc.h>

#include <altivec.h>
#undef vector

void GenerateRandom16bitValues(int16_t* inputABuffer, int16_t* inputBBuffer, int32_t* outputBuffer, int size);
void GenerateRandom32bitValues(int32_t* inputABuffer, int32_t* inputBBuffer, int32_t* outputBuffer, int size);
void TestAdd();
void TestMultiply();
void av_AddValues(int32_t* intArrayA, int32_t* intArrayB, int32_t* outputBuffer, int size);
void av_MultiplyValues(int16_t* intArrayA, int16_t* intArrayB, int32_t* outputBuffer, int size);

int main()
{
  TestAdd();
  TestMultiply();
}

void GenerateRandom16bitValues(int16_t* inputABuffer, int16_t* inputBBuffer, int32_t* outputBuffer, int size)
{
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> dis(std::numeric_limits<int16_t>::min(), std::numeric_limits<int16_t>::max());

  for(int i = 0; i < size; i++)
  {
    inputABuffer[i] = dis(gen);
    inputBBuffer[i] = dis(gen);
    outputBuffer[i] = 0;
  }
}

void GenerateRandom32bitValues(int32_t* inputABuffer, int32_t* inputBBuffer, int32_t* outputBuffer, int size)
{
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> dis(std::numeric_limits<int32_t>::min(), std::numeric_limits<int32_t>::max());

  for(int i = 0; i < size; i++)
  {
    inputABuffer[i] = dis(gen);
    inputBBuffer[i] = dis(gen);
    outputBuffer[i] = 0;
  }
}

void TestAdd()
{
    int size = 10'485'760;
    int bytes = size * sizeof(int32_t);

    int32_t* inputABuffer = (int32_t*) memalign(64, bytes);
    int32_t* inputBBuffer = (int32_t*) memalign(64, bytes);
    int32_t* outputBuffer = (int32_t*) memalign(64, bytes);
    assert(inputABuffer != nullptr);
    assert(inputBBuffer != nullptr);
    assert(outputBuffer != nullptr);

    GenerateRandom32bitValues(inputABuffer, inputBBuffer, outputBuffer, size);

    for(int i = 0; i < 20; i++)
    {
      auto start = std::chrono::high_resolution_clock::now();
      av_AddValues(inputABuffer, inputBBuffer, outputBuffer, size);
      auto end = std::chrono::high_resolution_clock::now();
      auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

      for(int k = 0; k < size; k++)
      {
        assert(outputBuffer[k] == (inputABuffer[k] + inputBBuffer[k]));
      }

      std::cout << "Vector Sum - " << diff.count() << "ms\n";
      memset(outputBuffer, 0, size);
    }
}

void TestMultiply()
{
    int size = 10'485'760;
    int16_t* inputABuffer = (int16_t*) memalign(64, size * sizeof(int16_t));
    int16_t* inputBBuffer = (int16_t*) memalign(64, size * sizeof(int16_t));
    int32_t* outputBuffer = (int32_t*) memalign(64, size * sizeof(int32_t));
    assert(inputABuffer != nullptr);
    assert(inputBBuffer != nullptr);
    assert(outputBuffer != nullptr);

    GenerateRandom16bitValues(inputABuffer, inputBBuffer, outputBuffer, size);

    for(int i = 0; i < 20; i++)
    {
      auto start = std::chrono::high_resolution_clock::now();
      av_MultiplyValues(inputABuffer, inputBBuffer, outputBuffer, size);
      auto end = std::chrono::high_resolution_clock::now();
      auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

      for(int k = 0; k < size; k++)
      {
        assert(outputBuffer[k] == (inputABuffer[k] * inputBBuffer[k]));
      }

      std::cout << "Vector product - " << diff.count() << "ms\n";
      memset(outputBuffer, 0, size);
    }
}

void av_AddValues(int32_t* intArrayA, int32_t* intArrayB, int32_t* outputBuffer, int size)
{
  int iterations = size / (sizeof(__vector int32_t) / sizeof(int32_t));

  __vector int32_t* tempA = (__vector int32_t *) intArrayA;
  __vector int32_t* tempB = (__vector int32_t *) intArrayB;
  __vector int32_t* tempOut = (__vector int32_t *) outputBuffer;

  for(int i = 0; i < iterations; i++)
  {
    __vector int32_t sum = vec_add(*tempA, *tempB);
    vec_st(sum, 0, tempOut);

    tempA++;
    tempB++;
    tempOut++;
  }
}

void av_MultiplyValues(int16_t* intArrayA, int16_t* intArrayB, int32_t* outputBuffer, int size)
{
  int iterations = size / (sizeof(__vector int16_t) / sizeof(int16_t));
  __vector int16_t* tempA = (__vector int16_t *) intArrayA;
  __vector int16_t* tempB = (__vector int16_t *) intArrayB;
  __vector int32_t* tempOut = (__vector int32_t *) outputBuffer;
  for(int i = 0; i < iterations; i++)
  {
    __vector int32_t productEven = vec_mule(*tempA, *tempB);
    __vector int32_t productOdd = vec_mulo(*tempA, *tempB);

    __vector int32_t mergedProductHigh = vec_mergeh(productEven, productOdd);
    __vector int32_t mergedProductLow = vec_mergel(productEven, productOdd);

    vec_st(mergedProductHigh, 0, tempOut);
    tempOut++;
    vec_st(mergedProductLow, 0, tempOut);

    tempA++;
    tempB++;
    tempOut++;
  }
}

Output of perf stat and perf record:

  Adding
   Performance counter stats for './alti':

         2151.146080      task-clock (msec)         #    0.999 CPUs utilized          
                   9      context-switches          #    0.004 K/sec                  
                   0      cpu-migrations            #    0.000 K/sec                  
               30957      page-faults               #    0.014 M/sec                  
          3871497132      cycles                    #    1.800 GHz                    
     <not supported>      stalled-cycles-frontend  
     <not supported>      stalled-cycles-backend   
          1504538891      instructions              #    0.39  insns per cycle        
           234038234      branches                  #  108.797 M/sec                  
              687912      branch-misses             #    0.29% of all branches        
           270305159      L1-dcache-loads           #  125.656 M/sec                  
            79819113      L1-dcache-load-misses     #   29.53% of all L1-dcache hits  
     <not supported>      LLC-loads                
     <not supported>      LLC-load-misses          

         2.152697186 seconds time elapsed


  CPU Utilization
    76.04%  alti     alti                 [.] av_AddValues    

  Multiply

  Performance counter stats for './alti':

         1583.016640      task-clock (msec)         #    0.999 CPUs utilized          
                   4      context-switches          #    0.003 K/sec                  
                   0      cpu-migrations            #    0.000 K/sec                  
               20717      page-faults               #    0.013 M/sec                  
          2849050875      cycles                    #    1.800 GHz                    
     <not supported>      stalled-cycles-frontend  
     <not supported>      stalled-cycles-backend   
          1520409634      instructions              #    0.53  insns per cycle        
           179185029      branches                  #  113.192 M/sec                  
              535437      branch-misses             #    0.30% of all branches        
           205341530      L1-dcache-loads           #  129.715 M/sec                  
            27124936      L1-dcache-load-misses     #   13.21% of all L1-dcache hits  
     <not supported>      LLC-loads                
     <not supported>      LLC-load-misses          

         1.584145737 seconds time elapsed


  CPU Utilization
    60.35%  alti     alti               [.] av_MultiplyValues       
shaboinkin
  • 171
  • 1
  • 11
  • 1
    *How* are you measuring this? How *often* did you measure this? In what *order* are you running the two tests? Post a [MCVE] – EOF May 31 '17 at 04:47
  • Those times seem very high - are you compiling with optimisation enabled (e.g. `-O3`) ? Also, what CPU are you using and what is the clock speed ? – Paul R May 31 '17 at 10:44
  • @eof I editted my post containing a working example. At first i was only running once, but I now loop through the two routines I'm measuring and the times are consistent. Adding takes 81ms and multiplying takes 48ms. As stated in my post, I was simply using std::chrono::high_resolution_clock to measure the times. Is there a better alternative? – shaboinkin May 31 '17 at 11:46
  • @PaulR, I'm using NXP's T2080 board which contains a quad core e6500 CPU at 1.8GHz. I was using -O2, not O3. – shaboinkin May 31 '17 at 11:46
  • @shaboinkin: OK - presumably this is gcc, but `-O2` should be OK. I would try `-O3` anyway. I think you might also want to try manually unrolling the add loop by a factor of 2, as you have a lot of overhead on each iteration for just one add instruction. – Paul R May 31 '17 at 12:16
  • @PaulR I understand loop unrolling but not with a potentially variable number of iterations that needs to be done. Could you explain what you mean by "a factor of 2"? Does this mean to do additional adds within each iteration? Add, store, increment pointers, add, store, increment, then loop again? – shaboinkin May 31 '17 at 12:41
  • Yes, manual unrolling the add loop by a factor of 2 would be 4 x vec_ld, 2 x vec_add, 2 x vec_st. Use hard-coded offsets for the loads and stores and then only increment the pointers once per loop iteration (`+= 2`). This helps to bury some of the load/store latencies and also reduces the loop overheads. – Paul R May 31 '17 at 13:04
  • @PaulR I tried O3 with no difference. Unrolling didn't make a difference either, so I did it by a factor of 4 and curiously, it's taking about 1ms longer than without the unrolling. – shaboinkin May 31 '17 at 16:40
  • It's possible that the compiler is already unrolling the loop, so it may not help. I think the next thing I would do is look at the generated code and/or run the code under a (sampling) profiler. – Paul R May 31 '17 at 17:06
  • @PaulR That's what I just did actually. I edited my post to include what I recorded using the perf tool. So while less instructions are required to perform the addition, it seems that memory access is the bottleneck, no? – shaboinkin May 31 '17 at 17:30
  • Yes, it could be a DRAM bandwidth bottleneck, although I would have thought this would be much the same limitation for both routines. Try reducing the total size of your data to something that will fit comfortably in L2 cache (you'll need to increase the no of iterations to still get a reasonable timing interval of course). – Paul R May 31 '17 at 20:44
  • My two cents : isn't that related to the fact that you multiply int16 and add int32? vector of int16 likely are twice wider than vectors of int32... – Regis Portalez Jun 01 '17 at 09:05
  • @RegisPortalez My understanding is that each vector instruction always processes 16 bytes of data. The vec_add instruction, regardless of if it processes 16 bytes of chars, shorts, or ints, uses an instruction, vadduwm, which requires only 1 cycle. And the vec_mule/vec_mulo instruction for a processing shorts maps to an instruction, vmuleuh, which requires 4 cycles. Multiplying chars requires a different instruction however, it also requires 4 cycles. – shaboinkin Jun 01 '17 at 12:06
  • in your code, you allocate size * sizeof(int16), which is half of size*sizeof(int32). Since you code is memory bound, and TestMultiply operates on half the size, it runs twice faster (or close to, since the write array has the same size). You probably have a 1.5X speedup – Regis Portalez Jun 01 '17 at 14:10
  • @RegisPortalez wow..not sure how I completely missed that. – shaboinkin Jun 01 '17 at 19:10

1 Answers1

3

It's related to sizes of your input buffers.

in one case (TestAdd) :

int size = 10'485'760;
int bytes = size * sizeof(int32_t);

int32_t* inputABuffer = (int32_t*) memalign(64, bytes);
int32_t* inputBBuffer = (int32_t*) memalign(64, bytes);
int32_t* outputBuffer = (int32_t*) memalign(64, bytes);

you allocate 3 * size * 4 bytes (sizeof(int32_t) = 4)

in the other (test_mul) :

int size = 10'485'760;
int16_t* inputABuffer = (int16_t*) memalign(64, size * sizeof(int16_t));
int16_t* inputBBuffer = (int16_t*) memalign(64, size * sizeof(int16_t));
int32_t* outputBuffer = (int32_t*) memalign(64, size * sizeof(int32_t));

you allocate size*4 + 2*size*2 (sizeof(int16_t) = 2)

Since this code is entirely memory bound, second code is (3*4) / (4 + 2*2) = 1.5x faster.

This is in line with your measurements since 2.15 / 1.5 = 1.43, which is close to 1.58.

Regis Portalez
  • 4,675
  • 1
  • 29
  • 41