Performance of integer and bitwise operations on GPU

Question

Though GPUs are supposed for use with floating point data types, I'd be interested in how fast can GPU process bitwise operations. These are the fastest possible on CPU, but does GPU emulate bitwise operations or are they fully computed on hardware? I'm planning to use them inside shader programs written with GLSL. Also I'd suppose that if bitwise operations have full preformance, integer data types should have also, but I need confirmation on that.

To be more precise, targeted versions are OpenGL 3.2 and GLSL 1.5. Hardware that should run this is any Radeon HD graphics card and GeForce series 8 and newer.. If there are some major changes in newer versions of OpenGL and GLSL related to processing speeds of bitwise operations/integers, I'd be glad if you'll point them out.

You need to specify a particular GPU architecture, or at least OpenGL version. Right now this question is horribly vague. — Ben Voigt, Dec 30 '11 at 22:21
@BenVoigt updated, is it precise enough, or you need specific code name of architecture (they change them like for every new card) — Raven, Dec 30 '11 at 22:42
Raven: There are some huge changes between Radeon HD 1xxx and HD 7xxx, but that extra information is a big improvement. Assuming that you're looking at cards which advertise OpenGL 3.2 support (or later), that's probably clear enough. — Ben Voigt, Dec 30 '11 at 23:10

score 13 · Accepted Answer · edited May 23 '17 at 12:00

13

This question was partially answered Integer calculations on GPU

In short modern GPUs have equivalent INT and FP performance for 32bit data. So your logical operations will run at the same speed.

From a programming perspective you will lose performance if you are dealing with SCALAR integer data. GPUs like working with PARALLEL and PACKED operations.

for(int i=0; i<LEN_VEC4; i++)
    VEC4[i] = VEC4[i] * VEC4[i]; // (x,y,z,w) * (x,y,z,w)

If you're doing something like...

for(int i=0; i<LEN_VEC4; i++)
    VEC4[i].w = (VEC4[i].x & 0xF0F0F0F0) | (VEC4[i].z ^ 0x0F0F0F0F) ^ VEC4[i].w;

...doing many different operations on elements of the same vector you will run into performance problems.

edited May 23 '17 at 12:00

Community

1
1

answered Jan 03 '12 at 16:05

Louis Ricci

20,804
5
48
62

Thanks for your answer. In combination with linked post it is sufficient, but i got one more question. As wrote, INT and FP performance should be the same. But there is nothing like bitwise operations for FP (or at least it would be strange to do). So what are they saying to be equal.. adding and so on? And if that's the case, are bitwise ops (e.g. shifting) faster than math ops (adding..) for INT data types, or the perfomance is also equal. – Raven Jan 03 '12 at 18:43
Whether "X bit shift left by 1" is faster than "x + x" is pretty architecture dependent. I'd hope that some optimization would occur when your shadar is compiled (unless your writing it in GPU assembly). "X divide by 2" is of source slower than "X bit shift right 1" just because there is more logic involved in divide than bit shift. – Louis Ricci Jan 04 '12 at 12:44
5

"GPUs like working with PARALLEL and PACKED operations." The most recent GPUs of NVidia and AMD are scalar architectures. So the performance for purely scalar operations is in fact higher than for vector operations. – datenwolf Jan 04 '12 at 12:45
@datenwolf Good to know, but I can support the fact that at least in openGL 3.2 they're working with packed formats. I noticed this when I was building a big array of variables and maximum size was the same for scalars and 4D vectors.. Only conclusion I could made is that all data was stored in vectors. As you said, this should be different for the most modern GPUs. – Raven Jan 04 '12 at 18:27
That happens if you use std140 layout instead of 430. – avl_sweden Nov 23 '17 at 05:42
3

GPUs aren't scalar - they're vectorized on *all operations that are possible to do*. you won't get any additional vectorization gains from using vectors except for a very, very few instructions that don't include general float operations; because each "thread" on the gpu is actually a SIMD lane. – lahwran Mar 19 '18 at 04:32
This answer is misleading. When you optimize for a vector machine (or when a shader is compiled), you do _not_ line up linear algebra vectors with SIMD vector registers. Instead, you put one register with the x components of various different linalg vectors, another with all the y components, etc. That way, acting separately on different mathematical vector components, adding across them (dot product, normalization), or working with scalars, can still be parallelized. – hegel5000 May 10 '23 at 15:27

Performance of integer and bitwise operations on GPU

1 Answers1

Linked