0

I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.

const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();

SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();

const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);    

for(size_t n = 0; n < out_neurons; n++){
    float accum = 0.0;
    float temp[4] = {0,0,0,0};
    float *p = temp;

    __m128 in, ws, dp;

    for(size_t i = 0; i < in_neurons; i+=4){

        // read and save the weights correctly by applying the mask
        temp[0] = scl[(i+0)*out_neurons + n];
        temp[1] = scl[(i+1)*out_neurons + n];
        temp[2] = scl[(i+2)*out_neurons + n];
        temp[3] = scl[(i+3)*out_neurons + n];

        // load input neurons sequentially
        in = _mm_load_ps(&src[i]);

        // load weights
        ws = _mm_load_ps(p);

        // dot product
        dp = _mm_dp_ps(in, ws, 0xff);

        // accumulator
        accum += dp.m128_f32[0]; 
    }
    // save the final result
    dst[n] = accum.m128_f32[0];
}

It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).

**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32 
Input: 28x28x32 = 25K   
Weights: (3*3*32)*32 = 9K   
Number of MACs: 3*3*27*27*32*32 = 7M    
Execution Time on OpenVINO framework: 0.049 ms

**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576    
Weights: 3*3*64*512 = 295K  
Number of MACs: 295K    
Execution Time on OpenVINO framework: 0.197 ms

Thanks for all help in advance!

  • Please specify the command-line that you used to build your application. Did you enable optimizations? – PaulMcKenzie Sep 24 '19 at 15:37
  • Hi PaulMcKenzie, I used the build samples from the openVINO framework which uses Visual Studio. I set it to Visual Studio 2019 using O2 optimizations. Cmd line: inference_engine\samples\build_samples_msvc.bat VS2019 – César Gouveia Sep 24 '19 at 15:48
  • @CésarGouveia Please read how (and why) to provide a [mre]. If possible, avoid any 3rd-party dependencies. – chtz Sep 24 '19 at 16:02
  • 1
    Regarding performance `[v]dpps` is only useful in very few situations. For big dot-products you should only have `vfmadd` (or `vmulps` and `vaddps`, if you don't have FMA) in the inner loop and a few horizontal reductions at the end. – chtz Sep 24 '19 at 16:06

1 Answers1

3

Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.


You can drastically simplify your implementation by incrementing n+=8 and i+=1 (assuming out_neurons is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst values at once.

A very simple implementation assuming FMA is available (otherwise use multiplication and addition):

void dot_product(const float* src, const float* scl, float* dst,
                 const int in_neurons, const int out_neurons)
{
    for(size_t n = 0; n < out_neurons; n+=8){
        __m256 accum = _mm256_setzero_ps();

        for(size_t i = 0; i < in_neurons; i++){
            accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
        }
        // save the result
        _mm256_storeu_ps(dst+n ,accum);
    }
}

This could still be optimized e.g., by accumulating 2, 4, or 8 dst packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps instruction), but also compensate latencies of the FMA instruction.

Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi

chtz
  • 17,329
  • 4
  • 26
  • 56
  • Hi @chtz, first of all thank you for the code sample. I just modified my code in order to work with your sample but it does not work out of the box. My question is fmadd produces 8 partial accumulations and saves those in accum, we need to add this partial values together at the end, or am I thinking wrong? – César Gouveia Oct 01 '19 at 14:44
  • Unless I made some mistake and assuming `out_neurons` is a multiple of 8, my code should be (mathematically) equivalent to your code (there will be numerical differences, due to the different order of additions). I could have tested this, if you had provided a [mre] ... – chtz Oct 01 '19 at 16:32