0

let say I want to perform a running horizontal average on x-axis of an image.

Func g;
g(x,y) = (img(x-1,y) + img(x,y) + img(x+1,y))/3.f;
h(x,y) = cast<uint8_t>(g(x,y) + 0.5f);

Using float32 for g(x,y) seems to be overkill but I do care about precision so an integer division is not preferred.
Can I use float16_t instead of float32_t to gain more throughput ?

Could it be done in in this way ?

Expr three = <cast>(float16_t(3.f));
Expr point5 = <cast>float16_t(0.5f);
g(x,y) = (img(x-1,y) + img(x,y) + img(x+1,y))/three;
h(x,y) = cast<uint8_t>(g(x,y) + point5);

I'm going to use an auto scheduler to do the job. It seems that avx2 has to ability to process float16_t in parallel. Will there be a problem if this piece of code be generated with the target of x86_64-sse4.1 ?

prgbenz
  • 1,129
  • 4
  • 13
  • 27

1 Answers1

2

float16 conversions exist on avx2, but it doesn't actually do float16 math in parallel, so it'll be slow. I recommend using uint16 instead for this sort of thing. It's actually more precise than using floats for the code you've given:

Func in16, g;
in16(x, y) = cast<uint16_t>(img(x, y));
g(x,y) = in16(x-1,y) + in16(x,y) + in16(x+1,y);
h(x,y) = cast<uint8_t>(g(x,y) + 1)/3);

The division operation will use the x86 vector instruction pmulhuw, so it'll be fast.

Andrew Adams
  • 1,396
  • 7
  • 3