Trouble understanding GPU disassembly

Question

I'm trying to write a raycasting shader in GLSL, and it's being unbearably slow. So I installed AMD's "GPU Shader Analyzer", so I can look at what is actually generated. I've got it from 2 FPS up to 12, but that's still not fantastic.

I feel like I could improve it, but I'm stuck at three points.

Weird Underscores: I get what ADD R1.x, R0.x, -C6.x does; subtracts C6.x from R0.x, and stores it in R1.x. Similarly with ADD R4.x, R1.x, R2.w, R4.x; Multiply R1.x and R2.w, add on R4.x, and store in R4.x. But sometimes I get calls like MUL __, PV16.x, C1.x, and I can't figure out what the underscores mean.
Trailing "E"s: Usually my multiplications are turned into MUL a, b, c. But sometimes I see MUL_e a, b, c. This also happens with SQRT_e, RSQ_e and RCP_e.
Magic: I just plain don't get these instructions.
LOOP_DX10 i0 FAIL_JUMP_ADDR(10) VALID_PIX Begin loop. But what are the parameters?
ALU_BREAK: ADDR(48) CNT(3) No idea.
SETGT_INT R0.y, 350, R3.y My for loop has i < 350, but what're the others?
PREDNE_INT __, R0.y, 0.0f Maybe set i to 0? But why floating-point 0?
ALU_PUSH_BEFORE: ADDR(51) CNT(34) Push makes me think of the stack?
PREDGT __, R0.x, R3.x No clue.
JUMP POP_CNT(1) ADDR(8) VALID_PIX Unconditional jump, but what's POP_CNT?
ALU: ADDR(85) CNT(1) Whoosh.
BREAK ADDR(9) Jump to 9?
POP (1) ADDR(8) Removes the frame from the stack? Why 8?
ENDLOOP i0 PASS_JUMP_ADDR(2) Ends the loop starting with LOOP_DX10.
CNDE_INT R0.x, R2.z, 0.0f, 1065353216 x = q ? a : b, but I don't know which variable is which.

Could someone please explain these? I can't find any documentation for the first two, and I don't understand the docs for the last. I've never done any assembly before, unfortunately.

score 1 · Accepted Answer · answered May 21 '13 at 20:13

I've found this document and this document describing the assembly language, that explains some of the mnemonics you have found in the assembly.

At this level, the assembly is very specific to the hardware; since you have used AMD tools, I thought to look for AMD devices documents. I won't be surprised if NVIDIA uses a different instruction set.

Since you have marked the question with glsl, maybe you are on the wrong way. OpenGL Shading Language is used because portability, since it's an open-industry standard; instead, using assembly you couple the program with a specific graphics card family. For instance, my programs runs on Linux and Windows, and on a wide range of GPUs of NVIDIA, AMD and Intel (it was not easy, but satisfactory).

If you still want portability, and you are so brave to write GPU assembly, you can implement programs using ARB assembly (vertex and fragment), but I've never tried (and you gave me now a good inspiration to start another journey).

Thanks! It looks like I've got some reading to do now, but from the quick skim, it addresses what I was confused about. :) And I was writing it in GLSL, but looking at the assembly to see where the bottleneck is (or if it's impossible to speed up more). — Henry Swanson, May 21 '13 at 21:42

Trouble understanding GPU disassembly

1 Answers1