SUPPLEMENTAL CONCEPTUAL HELP
Before I begin, I should explain I do not and have never worked for a GPU manufacturer. Some of what I say below may be factually wrong, but it is how I understand it as a programmer.
Below is an image of a modern GPU. This image shows 8 general purpose pipes each containing 8 queues so it can process 64 instructions single instruction operations per cycle of the clock.
Old GPU had a fixed non-programmable pipeline and we are not really interested in those.
Middle GPU had specific pipes to run vector programs, and different pipes for pixel shading.
Modern GPU have general purpose pipes that can run any type of program (including tessellation, compute, etc)
The arbitration and allocation probes, decide which pipes should run which programs, and what inputs should be sent to them, so that as much of the processor as possible is being used each cycle. As a programmer we have nothing to do with these, and so this is a total black box to me.
We are writing the programs that control the pipes. So imagine the AA probe has decided to use pipe0 as a pixel shader (I assume your program is doing something with colour as you not worried about rounding, which would cause verts to jump about). It will then pick 8 pixels that require the same program (see texture), and load them into the process buffers. All 8 pixels are then run in parallel one instruction at a time, until the program is completed, and the pipe is given back to the AA probe to be given a new job. If there are less than 8 pixels that need that program, the pipe is run with some of the process buffers empty, and the chip is underutilized there isn't much you can do about this, but it is why zooming out to single pixel objects all with different textures over you screen kills the GPU.
So in one cycle one computational pipe can do 8 muls for 8 pixels or 8 sins for 8 pixels, but it has to run every instruction for every pixel linearly, that is the reason that if statements are so complex for shader programs. pixels that pass the condition are processed, pixels that fail still have to wait the cycles while the passing pixels are processed.
Obviously, every place I have said pixel, it could be a vert, or a CU element.
The only other thing that I can think to mentioned here is precision. When you lower the precision it allows a processing buffer to be stuffed more densely. So if you are using half precision everywhere, instead of the GPU processing 64 numbers per second it can do 128, and so on.
That's roughly how a GPU works. I certainly found understanding the architecture made a lot more sense of why shader programs are the way they are.
