How much of a modern graphics pipeline uses dedicated hardware?

Question

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL), where and why would it be slower that the stock implementations on NVIDIA and AMD cards?

I can see how vertex/fragment/geometry/tesselation shaders could be made nice and fast using GPGPU, but what about things like generating the list of fragments to be rendered, clipping, texture sampling and so on?

I'm asking purely for academic interest.

I'm pretty damn sure modern OpenGL and DirectX use your graphics chip extensively. What is your question exactly? — Mat, Oct 30 '11 at 13:17
Of course they use your graphics chip extensively. GPGPU such as CUDA and OpenCL also uses your graphics chip extensively. But the 'chip' is made up of various parts with different functions. Some of those parts are inherently programmable (eg using vertex shaders or OpenCL). My question then is what (if any) parts of the graphics pipeline use other, fixed functionality parts of the chip whose function could not be implemented using, say, OpenCL (at least at the same speed)? I'm guessing that things like texture compression and sampling are examples of this, but I don't know. — DaedalusFall, Oct 30 '11 at 13:28
You might want to have a look at this [excellent series of blog posts](http://fgiesen.wordpress.com/category/graphics-pipeline/) from Fabian Giesen to help you along. — Bart, Oct 30 '11 at 14:55
@Bart: Not seen enough of your link yet to see if its what I want to know, but It definately looks interesting. Thanks. — DaedalusFall, Oct 31 '11 at 10:59

Anteru · Accepted Answer · 2011-10-31T06:18:13.150

14

Modern GPUs have still lots of fixed-function hardware which is hidden from the compute APIS. This includes: The blending stages, the triangle rasterization and a lot of on-chip queues. The shaders of course all map well to CUDA/OpenCL -- after all, shaders and the compute languages all use the same part of the GPU -- the general purpose shader cores. Think of those units as a bunch of very-wide SIMD CPUs (for instance, a GTX 580 has 16 cores with a 32 wide SIMD unit.)

You get access to the texture units via shaders though, so there's no need to implement that in "compute". If you would, your performance would suck most likely as you don't get access to the texture caches which are optimized for spatial layout.

You shouldn't underestimate the amount of work required for rasterization. This is a major problem, and if you throw all of the GPU at it you get roughly 25% of the raster hardware performance (see: High-Performance Software Rasterization on GPUs.) That includes the blending costs, which are also done by fixed-function units usually.

Tesselation has also a fixed-function part which is difficult to emulate efficiently, as it amplifies the input up to 1:4096, and you surely don't want to reserve so much memory up-front.

Next, you get lots of performance penalties because you don't have access to framebuffer compression, as there is again dedicated hardware for this which is "hidden" from you when you're in compute only mode. Finally, as you don't have any on-chip queues, it will be difficult to reach the same utility ratio as the "graphics pipeline" gets (for instance, it can easily buffer output from vertex shaders depending on shader load, you can't switch shaders that flexibly.)

edited Oct 31 '11 at 06:18

answered Oct 30 '11 at 14:50

Anteru

19,042
12
77
121

Don't forget about all the attribute uploading logic. The free conversion from normalized values to floats, the post-T&L cache, etc. – Nicol Bolas Oct 30 '11 at 16:56
Maybe it was better to refer as programmable and fixed function parts of the GPU. After all the rasterization stage is a functional part of the GPU as are the shader units. – datenwolf Oct 30 '11 at 17:24
That pretty much answers my question. I must admit I didn't think triangle rasterization would be a problem at all, thanks for the link. You say there's no need to implement texture units in "compute", but theres no need to do any of this in "compute", so why stop shy of texture units? Maybe I want to write my own texture compression formats and implement previously unheard of sampling techniques :). Good answer, thanks. – DaedalusFall Oct 31 '11 at 11:07
I thought you wanted to use CUDA/OpenCL ("compute"), where you do get access to the hardware texture units but not for lots of the rest (the idea to do all in compute would be that you can modify any part, which would be surely cool ;, and just use the exposed hardware. The same argument has been made for Larrabee, which is basically compute+texture units.) – Anteru Oct 31 '11 at 12:50
The reason for my mind wandering to this line of enquiry was that I redcently got a new AMD card, and the (linux) drivers for it were sucky. NVIDIA and AMD both seem to have better (more stable) linux suppot for GPGPU than for general desktop graphics (from my limited observation). My thought was whether a GL implementation for X could be written that that just used exposed compute capability. That way you can make the drivers more stable and more uniform (which may require not using the hw tex units). Of course, even if it was possible I have neither the requisite knowledge nor the time! – DaedalusFall Oct 31 '11 at 16:18

score 1 · Answer 2 · answered Jan 06 '14 at 22:40

an interesting source code link : http://code.google.com/p/cudaraster/

and corresponding research paper: http://research.nvidia.com/sites/default/files/publications/laine2011hpg_paper.pdf

Some researchers at Nvidia have tried to implement and benchmark exactly what was asked in this post : "Open-source implementation of "High-Performance Software Rasterization on GPUs"" ...

And it is open source for "purely academic interest" : it is a limited sub-set of Opengl, mainly for benchmarking rasterization of triangles.

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL)

Do you realize, that before CUDA and OpenCL existed, GPGPU was done by shaders accessed through DirectX or OpenGL?

Reimplementing OpenGL on top of OpenCL or CUDA would introduce unneccessary complexity. On a system that supports OpenCL or CUDA, the OpenGL and DirectX drivers will share a lot of code with the OpenCL and/or CUDA driver, since they access the same piece of hardware.

Update

On a modern GPU all of the pipeline runs on the HW. That's what the whole GPU is for. Whats done on the CPU is bookkeeping and data management. Bookkeeping would be the whole transformation matrix setup (i.e. determine the transformation matrices, and assign them to the proper registers of the GPU), geometry data upload (transfer geometry and image data to GPU memory), shader compilation and last but not least, "pulling the trigger", i.e. send commands to the GPU that make it execute the prepared program to draw nice things. Then the GPU will by itself fetch the geometry and image data from the memory, process it as per the shaders and parameters in the registers (=uniforms).

Maybe better fitted as a comment as "don't do it, not of practical value"-answers never really answer the question in any way. Even more so when the question is "purely for acedemic interest" and just wants to know which parts of a modern graphics pipeline use dedicated hardware. — Christian Rau, Oct 30 '11 at 14:11
I wasn't talking about what happens on the CPU side, only the difference between what is available to OpenCL/CUDA compared to what the card as a whole can do. — DaedalusFall, Oct 31 '11 at 10:58

How much of a modern graphics pipeline uses dedicated hardware?

3 Answers3

Update