will this sinus approximation be faster than a shader CG sinus function?

Question

I have some functions that are not really sines but they are a lot quicker than conventional processing, they are simple parabole functions.

Will this be faster on a graphics processor than the built-in graphics sinus function:

    float  par (float xx){////// sinus approximation
        half xd =((fmod(abs(xx), 2.4)) - 1.2);
        if ( fmod (abs(xx) , 4.8)  > 2.4) { xd=(-xd*xd)+2.88;}
        else {xd = xd*xd;}
        xd = -xd*0.694444444+1;
        if (  (xx<0) ) { xd=-xd;}
        return xd;
    }

performance-of-different-cg-glsl-hlsl-functions http://stackoverflow.com/questions/8415251/performance-of-different-cg-glsl-hlsl-functions — bandybabboon, Oct 17 '13 at 09:53
I am jumping from DSP to GPU coding. Why mark me down for this question?! explain yourself? it's antisocial. sinus takes about 40-70 cycles on CPU, and the said paraboloid takes 10 cycles on CPU, why shouldn't i ask this about GPU as i am coding my first GPU shaders? — bandybabboon, Feb 24 '14 at 02:54

Strings · Accepted Answer · 2013-09-21T11:03:29.483

MAIN ANSWER

There is absolutely no way your function will be faster than the built in sin/cos functions on any graphics cards.

The shader instructions sin ,cos & tan are single-cycle instructions on just about EVERY graphics card ever manufactured. You certainly cannot purchase a graphics card today where it isn't a single-cycle.

To put your question in perspective - on a graphics card, it takes the same time to multiple 2 numbers (mul instruction) as it does to get the sinus (sin function) - a single GPU cycle.

When writing your shaders have a look at the command line options for your compiler. There will be options to output the assembly code generated, and most compilers even provide totals for the shortest path (number of instructions and cycles) and the longest path. These totals are not guaranteed durations because things like fetch can stall a pipeline, but they answer the type of question you are now asking.

Shader instruction do vary from card to card, but I think the longest single instruction is 4 GPU cycles.

If you took a look at the shader compiler assembly output for your function you are calling lots of instructions, using lots of cycles, and then asking if it could be executed more quickly than a single cycle instruction.

The whole purpose of Graphics Chips is that they are very fast and very parallel at running their instruction sets (however complex those instructions may be on other processors). When programming shaders focus your code on what the processor is designed to do. Shader programming is a different mind set from the programming you do elsewhere in software development, but once you start thinking about counting cycles, and minimizing fetch stalls, you'll soon start to open the true power of shader processing.

Best of luck.

thanks that's very enlightening. Is there a way of knowing how many cycles all the maths functions take, so on a gtx760, for example square root, atan, how to assess their computational load relative to a multiplication?. Is it possible that the graphics card could do 20 multiplications in 1 cycle, and only 4 sinuses? — bandybabboon, Sep 21 '13 at 07:56
Looking at the output assembly for a shader you write should help you with this concept. You can change your program slightly and see how the GPU instructions change. I have given another answer below which I hope will also help conceptually. I don't know anywhere that lists instruction specifics per GPU, but I know they do vary. I am guessing that information is not publically available, but basic maths are all 1 instruction in 1 cycle. Texture fetching is when multiple cycles per instruction and wait cycles become important to the GPU. — Strings, Sep 21 '13 at 11:02
Interesting question but that probably applies to desktop graphics, what about Mobile devices with opengl es 2.0 implementations ? — Relok, Jan 13 '14 at 02:05

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

SUPPLEMENTAL CONCEPTUAL HELP

Before I begin, I should explain I do not and have never worked for a GPU manufacturer. Some of what I say below may be factually wrong, but it is how I understand it as a programmer.

Below is an image of a modern GPU. This image shows 8 general purpose pipes each containing 8 queues so it can process 64 instructions single instruction operations per cycle of the clock.

Old GPU had a fixed non-programmable pipeline and we are not really interested in those. Middle GPU had specific pipes to run vector programs, and different pipes for pixel shading. Modern GPU have general purpose pipes that can run any type of program (including tessellation, compute, etc)

The arbitration and allocation probes, decide which pipes should run which programs, and what inputs should be sent to them, so that as much of the processor as possible is being used each cycle. As a programmer we have nothing to do with these, and so this is a total black box to me.

We are writing the programs that control the pipes. So imagine the AA probe has decided to use pipe0 as a pixel shader (I assume your program is doing something with colour as you not worried about rounding, which would cause verts to jump about). It will then pick 8 pixels that require the same program (see texture), and load them into the process buffers. All 8 pixels are then run in parallel one instruction at a time, until the program is completed, and the pipe is given back to the AA probe to be given a new job. If there are less than 8 pixels that need that program, the pipe is run with some of the process buffers empty, and the chip is underutilized there isn't much you can do about this, but it is why zooming out to single pixel objects all with different textures over you screen kills the GPU.

So in one cycle one computational pipe can do 8 muls for 8 pixels or 8 sins for 8 pixels, but it has to run every instruction for every pixel linearly, that is the reason that if statements are so complex for shader programs. pixels that pass the condition are processed, pixels that fail still have to wait the cycles while the passing pixels are processed.

Obviously, every place I have said pixel, it could be a vert, or a CU element.

The only other thing that I can think to mentioned here is precision. When you lower the precision it allows a processing buffer to be stuffed more densely. So if you are using half precision everywhere, instead of the GPU processing 64 numbers per second it can do 128, and so on.

That's roughly how a GPU works. I certainly found understanding the architecture made a lot more sense of why shader programs are the way they are. Architecture of a modern Graphics Chip

will this sinus approximation be faster than a shader CG sinus function?

2 Answers2

Linked