5

To process 8-bit pixels, to do things like gamma correction without losing information, we normally upsample the values, work in 16 bits or whatever, and then downsample them to 8 bits.

Now, this is a somewhat new area for me, so please excuse incorrect terminology etc.

For my needs I have chosen to work in "non-standard" Q15, where I only use the upper half of the range (0.0-1.0), and 0x8000 represents 1.0 instead of -1.0. This makes it much easier to calculate things in C.

But I ran into a problem with SSSE3. It has the PMULHRSW instruction which multiplies Q15 numbers, but it uses the "standard" range of Q15 is [-1,1-2⁻¹⁵], so multplying (my) 0x8000 (1.0) by 0x4000 (0.5) gives 0xC000 (-0.5), because it thinks 0x8000 is -1. This is quite annoying.

What am I doing wrong? Should I keep my pixel values in the 0000-7FFF range? Doesn't this kind of defeat the purpose of it being a fixed-point format? Is there a way around this? Maybe some trick?

Is there some kind of definitive treatise on Q15 which discusses all this?

Jens Björnhager
  • 5,632
  • 3
  • 27
  • 47
Alex
  • 846
  • 6
  • 16
  • 1
    Well you could always throw in a special case just to handle 0x8000.. apart from that, I don't know. – harold Aug 29 '12 at 16:55
  • I know, but a special case in a tight inner loop kills the speed advantage, plus to do it for 4 channels at the same time is more hassle than it's worth. – Alex Aug 29 '12 at 16:57
  • it would probably still be faster than the C code. It just takes a shift and a pblendvb. Or would it actually be correct to always AND with 0x7FFF? – harold Aug 29 '12 at 17:07
  • 1
    I've also run into this problem and I agree that it's quite annoying. It's one of a number of cases where AltiVec got it right and SSE is broken (IMNVHO). – Paul R Aug 29 '12 at 18:06
  • 1
    I have used premultiplied values with success in a weighted average scenario. On the left side there were values int the range `0` to `0x7fff`, which represented the weights going from `0.0` to `1.0` (exactly), on the other side were the values to be weighted, which were arbitrary numbers generally lower than 1.0. I multiplied the values by a factor of 32768.0/32767.0 (which is basically equivalent to add 1 to each value greater than 2^14) and actually there was no loss of precision over the whole range thanks to the rounding that PMULHRSW does. – Gunther Piez Aug 29 '12 at 21:21
  • Paul R, could you please elaborate on the difference between AltiVec and SSE? – Alex Aug 30 '12 at 03:20
  • I've been running some tests and it seems that PMULHRSW is useless for precise calculations. It doesn't actually treat 7FFF as 1.0, it treats it as 7FFF/8000=0.999969482421875. Multiplying 7FFF by 7FFF gives 7FFE, which is logical, since we're actually multiplying 0.999969482421875 by itself. So, to keep this "1.0", at least one of the terms would have to be "8000", but 7FFF*8000=8001, because 8000 is -1. So, PMULHRSW can't multiply 1.0 by 1.0. Either I'm missing something, or is, indeed, broken. – Alex Aug 30 '12 at 11:58
  • drhirsch: how do you quickly divide by 0x7FFF in SSE? – Alex Aug 31 '12 at 09:10
  • @Alex: You don't. I had a table of fixed values (the weights) on the left hand side, which I _pre_ multiplied, meaning I did the correction only once. This was possible in my special scenario, I don't know if it does something for you - I just wanted to add an idea. – Gunther Piez Sep 05 '12 at 08:29
  • Thinking about it again, a "division" by Q15 `0x7fff` is actually possible if you limit yourself to the range of positive values: Compare with `0x3fff` and add 1 if greater than. Needs two instructions. – Gunther Piez Sep 05 '12 at 08:35

2 Answers2

3

Personally, I'd go with the solution of limiting the max value to 0x7FFF (~0.99something).

  • You don't have to jump through hoops getting the processor to work the way you'd like it
  • You don't have to spend a long time documenting the ins and outs of your "weird" code, as operating over 0-0x7FFF will be immediately recognisable to the readers of your code - Q-format is understood (in my experience) to run from -1.0 to +1.0-one lsb. The arithmetic doesn't work out so well otherwise, as the value of 1 lsb is different on each side of the 0!

Unless you can imagine yourself successfully arguing, to a panel of argumentative code reviewers, that that extra bit is critical to the operation of the algorithm rather than just "the last 0.01% of performance", stick to code everyone can understand, and which maps to the hardware you have available.


Alternatively, re-arrange your previous operation so that the pixels all come out to be the negative of what you originally had. Or the following operations to take in the negative of what you previously sent it. Then use values from -1.0 to 0.0 in Q15 format.

Martin Thompson
  • 16,395
  • 1
  • 38
  • 56
  • This is what I'll probably end up doing. What this means though is that PMULHRSW cannot multiply by 1.0, so the result will always have a bias towards black (http://bourt.com/blog/?p=448). – Alex Sep 03 '12 at 10:44
  • 1
    I just found this and was about to suggest the second option. +1 for that. – Stephen Canon Sep 03 '12 at 17:13
2

If you are sure that you won’t use any number “bigger” than $8000, the only problem would be when at least one of the multipliers is $8000 (–1, though you wish it were 1).

In this case the solution is rather simple:

pmulhrsw xmm0, xmm1
psignw xmm0, xmm0

Or, absolutely equivalent in our case (Thanks, Peter Cordes!):

pmulhrsw xmm0, xmm1
pabsw xmm0, xmm0

This will revert the negative values from multiplying by –1 to their positive values.

Zoltán Bíró
  • 346
  • 1
  • 12
  • SSSE3 [`pabsw`](http://felixcloutier.com/x86/PABSB:PABSW:PABSD.html) might work too (as a copy-and-abs). It formats the result as an unsigned integer, so the `0x8000` corner case which stays as `0x8000` with `pabsw` or `psignw` if it ever appears as the *result* of a multiply. (If that's even possible). – Peter Cordes Sep 17 '17 at 05:30