How would I use SSE to make sparse float matrix convolution faster?

Question

I have been given this piece of C code as part of an assignment. My task is to use OpenMP and Intel SSE to make it run faster. I understand the logic behind SSE and OpenMP however I cannot wrap my head around on what my approach should do.

I have given the code snippet below. Any help appreciated.

void team_conv_sparse(float *** image, struct sparse_matrix *** kernels,
               float *** output, int width, int height,
               int nchannels, int nkernels, int kernel_order) {

    int h, w, x, y, c, m, index;
    float value;

    // initialize the output matrix to zero
    for ( m = 0; m < nkernels; m++ ) {
        for ( h = 0; h < height; h++ ) {
            for ( w = 0; w < width; w++ ) {
                output[m][h][w] = 0.0;
            }
        }
    }

    DEBUGGING(fprintf(stderr, "w=%d, h=%d, c=%d\n", w, h, c));

    // now compute multichannel, multikernel convolution
//  #pragma omp parallel for
    for ( w = 0; w < width; w++ ) {
        for ( h = 0; h < height; h++ ) {
            for ( x = 0; x < kernel_order; x++) {
                for ( y = 0; y < kernel_order; y++ ) {
                    struct sparse_matrix * kernel = kernels[x][y];
                    for ( m = 0; m < nkernels; m++ ) {
                        for ( index = kernel->kernel_starts[m]; index < kernel->kernel_starts[m+1]; index++ ) {
                            int this_c = kernel->channel_numbers[index];
                            assert( (this_c >= 0) && (this_c < nchannels) );
                            value = kernel->values[index];
                            output[m][h][w] += image[w+x][h+y][this_c] * value;
                        }
                    } // m
                } // y
            } // x
        } // h
    }// w
}

Does `#pragma omp simd` help on any of the loops to get the compiler to auto-vectorize for you (with `#define NDEBUG` of course)? Did you google for existing Q&As about manually vectorizing convolution? Also, are you stuck using that nasty triple-pointer layout instead of contiguous memory arrays with 3D / 2D index calculations? If that's necessary for sparseness, I don't see where you're checking that pointer elements are non-NULL, or any kind of ragged / variable-length row handling. e.g. your output has `nkernels * height * width` elements and is gaining nothing from that indirection — Peter Cordes, Apr 03 '20 at 13:29
Vectorising sparse vector operations is not easy since you are not accessing consecutive memory locations. You may find some inspiration in articles such as [this one](https://www.mcs.anl.gov/~hongzh/publication/zhang-2018/ICPP_KNL_final.pdf). — Hristo Iliev, Apr 03 '20 at 13:35
The obvious way would be with AVX2 or AVX512 gather loads, or manual gathers into a SIMD vector for the multiply/add reduction into `output[m][h][w]`. But that would probably be pretty bad. So auto-vectorization is probably not going to be good. Probably you can vectorize over multiple uses of the same `image[]` element, but unfortunately it's the last index that's varying inside your inner-most loop. Maybe you can vectorize over `w` to store 4 contiguous `output` results? — Peter Cordes, Apr 03 '20 at 13:43
Even without vectorization putting the `w` and `h` loops in the inside should help. Mostly because `value` and `this_c` don't depend on them. And if you are able to change the store-order of `image` to `[c][h][w]` your compiler would likely be able to auto-vectorize the inner loop (the `w`-loop). — chtz, Apr 03 '20 at 14:19

How would I use SSE to make sparse float matrix convolution faster?

0 Answers0