How multiply convolutional core of 3x3 and an image

Question

There is a convolutional core of 3x3 and an image represented by an array of pixels of integer values.

A convolutional kernel is represented as follows:

//compound convolutional kernels
//                                | 1, 0,  1|
// convolutional kernel H = src x | 0, 0,  0|
//                                |-1, 0, -1|

//                                | 1, 0, -1|
// convolutional kernel V = src x | 0, 0,  0|
//                                | 1, 0, -1|

convolutional kernels = kernel H + kernel V

for(int inc=0; inc<height-2; inc++)
{
    //loaded 3 lines into memory
    str1_16pxs = _mm_loadu_si128((__m128i*)(src_all_str));
    str2_16pxs = _mm_loadu_si128((__m128i*)(src2_all_str));
    str3_16pxs = _mm_loadu_si128((__m128i*)(src3_all_str));

    //packing 16bit
    str1_16pxs_pack1st_8to16 = _mm_cvtepu8_epi16(str1_16pxs);
    str2_16pxs_pack1st_8to16 = _mm_cvtepu8_epi16(str2_16pxs);
    str3_16pxs_pack1st_8to16 = _mm_cvtepu8_epi16(str3_16pxs);

//---!
        //there is we make the first convolution for 8px's
        //... How ???
//---

    //summ 1st 8to16 vertical registers
    sum1_str12_vert_16pxs_pack1st_8to16  = _mm_add_epi16(str1_16pxs_pack1st_8to16,           str2_16pxs_pack1st_8to16);
    sum1_str123_vert_16pxs_pack1st_8to16 = _mm_add_epi16(sum1_str12_vert_16pxs_pack1st_8to16,str3_16pxs_pack1st_8to16);

    for(int jnc=0; jnc<(width >> 4); jnc++)
    {
        str1_16pxs_plus_8pxs = _mm_srli_si128(str1_16pxs, 8);
        str2_16pxs_plus_8pxs = _mm_srli_si128(str2_16pxs, 8);
        str3_16pxs_plus_8pxs = _mm_srli_si128(str3_16pxs, 8);

        //pack 2nd 8to16 registers (+8px's)
        str1_16pxs_pack2nd_8to16 = _mm_cvtepu8_epi16(str1_16pxs_plus_8pxs);
        str2_16pxs_pack2nd_8to16 = _mm_cvtepu8_epi16(str2_16pxs_plus_8pxs);
        str3_16pxs_pack2nd_8to16 = _mm_cvtepu8_epi16(str3_16pxs_plus_8pxs);

//---!
            //do convolution for the remaining 8px's and so on until the end of the read line
            //... How ???
//---

        //summ vertic 8to16 registers
        sum1_str12_vert_16pxs_pack2nd_8to16  = _mm_add_epi16(str1_16pxs_pack2nd_8to16,           str2_16pxs_pack2nd_8to16);
        sum1_str123_vert_16pxs_pack2nd_8to16 = _mm_add_epi16(sum1_str12_vert_16pxs_pack2nd_8to16,str3_16pxs_pack2nd_8to16);

//---!4     loading next 16 px's
        src_all_str += 16;
        src2_all_str += 16;
        src3_all_str += 16;

        //...

        _mm_store_si128((__m128i*)(dst_all_str), res);
        dst_all_str += 8;

    }//for(jnc)

}//for(inc)

looks in code: //---! //there is we make the first convolution for 8px's //... How ??? //--- — Georgy, May 18 '18 at 14:08
the code is for code. the question and a description what you have/what you want should go in the text. — Kami Kaze, May 18 '18 at 14:09
Please read [ask], especially the link about homework questions. — chtz, May 18 '18 at 22:43
How do you explain? There is an array of pixels represented by integer values. There is a core of convolution 3x3. It is necessary, after I read three lines from the array to process the kernel of the convolution. How to handle the core of the convolution array of integer values I do not know. In the "How?" I indicated the place where this operation was to be conducted. — Georgy, May 19 '18 at 09:59
These slides are interesting: https://pdfs.semanticscholar.org/presentation/17a5/501f7bb321844b85be535f4e7e196b5aaa33.pdf. They have an example of a 3x3 box filter using SSE2 intrinsics, with discussion about tiling for cache locality. (But mostly it's talking about Halide, which apparently lets you express an algo more simply and still get an optimized implementation.) — Peter Cordes, Jun 07 '18 at 10:21

score -2 · Answer 1 · answered Jun 07 '18 at 09:03

-2

So, sample code:

void SSEcode_Conv3x3 (unsigned char *src, int width, int height, short *dst) 
{
// Assert that width is a multiple of 16
if (width & 0xF) return;

unsigned char* src_line1 = src;
unsigned char* src_line3 = src + 2 * width;

__m128i zero = _mm_setzero_si128();

for (int i = 0; i < height - 2; i++) 
{
    __m128i line1 = _mm_load_si128((__m128i*)src_line1);
    __m128i line3 = _mm_load_si128((__m128i*)src_line3);
     for (int j = 0; j < width / 16 - 1; j++)
     {
        src_line1 += 16;
        src_line3 += 16;

        __m128i line1next = _mm_load_si128((__m128i*)src_line1);
        __m128i line3next = _mm_load_si128((__m128i*)src_line3);

       //blablabla
#ifdef USE_CORE_H
_mm_add_epi16
_mm_add_epi16
_mm_sub_epi16
#endif
       //blablabla

       _mm_store_si128((__m128i*)(dst + 8), res);
       line1 = line1next;
       line3 = line3next;

       dst += 16;
     }//for (j)

     src_line1 += 16;
     src_line3 += 16;

     //blablabla

     _mm_store_si128((__m128i*)(dst + 8), res);
     dst += 16;
}//for (i)

}

It took long time to write the code. I'm new, so it's a pity that a person who is well versed in the CE has not helped for intrisics.:(

answered Jun 07 '18 at 09:03

Georgy

7
1

Did you have a copy/paste error? Your code won't even compile you have `_mm_add_epi16` on a line by itself with no args. – Peter Cordes Jun 07 '18 at 09:23
Peter Cordes it is sample code.) The code takes 342 lines. – Georgy Jun 07 '18 at 10:20
Then this isn't an answer to your question. Why did you even post this without even a link to your real implementation? It's not useful at all to anyone who finds this Q&A while looking for a 3x3 image convolution. – Peter Cordes Jun 07 '18 at 10:22
Peter Cordes, for example, I can write an abstract explanation of what to do. – Georgy Jun 07 '18 at 11:36
Please use the edit link on your question to add additional information. The Post Answer button should be used only for complete answers to the question. - [From Review](/review/low-quality-posts/19954285) – Flovdis Jun 07 '18 at 11:58
@Flovdis: this is intended as an answer, it just isn't a good or complete answer. It's an outline of how to load the src and store the dst, with the actual computation being shown only as function names without showing which args to use. I guess something like doubling the loaded line1 and line3 data with add, then subtracting them, because the convolution only has 2 nonzero coefficients, and they're 2 and -2. The answer also doesn't say anything about why you'd use any of the instructions it does, so it's just a code-dump which doesn't help people adapt it for other cases. – Peter Cordes Jun 07 '18 at 12:03
@Georgy The main reason why people flagged the original question and this "answer" for deletion is, because it does not follow the guidelines of this site. The original question does not contain an actual question. So how should people know what you are asking? Please edit your question to make a proper question out of it. Also, make sure your answer is edited to make your points clear. If you leave it in this state, the question and answer will most likely get deleted. – Flovdis Jun 07 '18 at 12:14
@PeterCordes You already edited the question, so maybe you make a second edit and make clear what the question is. – Flovdis Jun 07 '18 at 12:15
@Flovdis: I'm not really sure. AFAICT, it's just asking for a SIMD 3x3 convolution implementation, but doesn't provide a sample scalar implementation. I'm not an expert on image-processing algos; I searched for 3x3 SSE convolution but surprisingly didn't find many existing implementations. IDK if it's specific to the kernel or if other terms would be needed to find example implementations. Actually now that you mention it, this answer would be an improvement to the framework code in the question, because it has types for its variables and a function signature. – Peter Cordes Jun 07 '18 at 12:29

How multiply convolutional core of 3x3 and an image

1 Answers1