How do fragments get generated by rasterizer in OpenGL

Question

I came across the description of rasterization and it basically says that when an object gets projected onto screen that what happens is that a scan takes place over all the pixels on the window/screen and decides if the pixel/fragment is within the triangle and hence determines that the pixel/fragment is inside the triangle and follows with further processing for the pixel/fragment such as colouring etc

Now since i am studying OpenGL and i do know that OpenGL probably has its own implementations of this process i was wondering whether this also takes place with OpenGL since of the "Scan-Conversion" process of vertices that i have read in OpenGL tutorial

Now another question related to this i have is that i know that the image/screen/window of pixels is an image or 2d array of pixels also known as the default framebuffer that is linear

So what i am wondering is if that is the case, how would projecting the 3 vertices of a triangle define which pixels are covered in side it?

Does the rasterizer draw the edges of a triangle first and then scans through each pixel or 2d array of pixels (also known as the default framebuffer) and sees if the points are between the lines using some mathematical method or some other simpler process happens?

No, it doesn't draw edges first. That would make adjacent triangles overlap by one pixel, which would cause trouble with semi-transparent ones. Also, it would be slow. What it does that It probably scans over smallest rectange enclosing the triagle an uses three cross products (or maybe dot products) to see if a pixel is inside of that triangle. — HolyBlackCat, Apr 12 '16 at 18:54
I guess something like this would work: `bool inside(ivec2 a, ivec2 b, ivec2 c, ivec2 p) {return cross(p-a,b-a)>0 && cross(p-b,c-b)>0 && cross(p-c,a-c);}`. — HolyBlackCat, Apr 12 '16 at 19:00
what if it was using linear interpolation to determine the points on the edges? and then using those points scanning across from one edge to other side of the triangle? that seems like a possibility? — gettingfaster, Apr 12 '16 at 19:01
Yes, that's another legit way. But AFAIK it's usually used in software rendering. For videocards it's a lot faster to do parallel per-pixel computations. — HolyBlackCat, Apr 12 '16 at 19:16
and parallel per-pixel computation is what you described? where the closest rectangle/square edge surrounds the triangle and then for each point on the square/rectangle each pixel within that interval/line segment is checked by something like dot product/cross product triangle point test? — gettingfaster, Apr 12 '16 at 19:20
@gettingfaster: OpenGL does not specify the details of how scan conversion of a triangle takes place. Whether it does edges first or scanlines or whatever, that's all up to the implementation. OpenGL does specify certain things about the results (connectivity guarantees and the like), but for the kinds of things you're talking about, the specification does not say. — Nicol Bolas, Apr 12 '16 at 19:30
but surely we have some idea of what happens, from what i gather there is not that many ways to do it? it seems that it is either linear interpolation or the parallel per pixel computations, the various graphics tutorials dont seem to describe any other way, in the end its bound by mathematical methods that exist out there coupled with the way the frambuffer holds the pixel information/data(i.e linearly as an array) — gettingfaster, Apr 12 '16 at 19:34
@gettingfaster: If it seems like there are not that many ways to do it... chew on this. The memory in a framebuffer is not organized linearly. It's common for there to be some kind of hierarchical structure for optimization--for example, divide the framebuffer into tiles, and then test to see which tiles the fragment hits, and then test pixels within each tile. If a fragment fills an entire tile, then as an optimization, the depth buffer for that tile may be computed using the plane equation for that triangle. Assumptions based on how you would do it in software are probably wrong. — Dietrich Epp, Apr 12 '16 at 19:40
but even if its divided into tiles in a structure for as you say "optimization" it still doesnt stop it being accessed in a linear fashion from beginning of window coordinates till the end i.e 0,0 - > N,N , all window coordinates are arranged like that and to further complicate that would only affect things in an expensive manner in the end and doesnt seem practical, i cant see why anyone would arrange a frame-buffer in tiles? it would just complicate things and result in very expensive processing , computers do have their limits and maximum speeds,in abstract everything is easy but reality no — gettingfaster, Apr 12 '16 at 19:47
"A new GPU-based scan-conversion algorithm implemented using OpenGL is described. The compute performance of this new algorithm running on a modem GPU is compared to the performance of three common scan-conversion algorithms (nearest-neighbor, linear interpolation and bilinear interpolation) implemented in software using a modem CPU. The quality of the images produced by the algorithm, as measured by signal-to-noise power, is also compared to the quality of the images produced using these three common scan-conversion algorithms." www.ncbi.nlm.nih.gov/pubmed/21710829 , looks like i was right — gettingfaster, Apr 12 '16 at 19:50
@gettingfaster: That article is about volumetric rendering, not about triangle rasterization. Completely different subject. — Dietrich Epp, Apr 12 '16 at 19:51
http://www.cse.wustl.edu/~jain/cse567-08/ftp/scan/ so its linear interpolation, bilinear interpolation and nearest neighbour — gettingfaster, Apr 12 '16 at 19:52
@gettingfaster: Again, that's about *ultrasonic scan conversion* as in it's a technique specific to equipment they use at hospitals to see inside patients. — Dietrich Epp, Apr 12 '16 at 19:53
yes but the point is that if its being used here there and everywhere then it shows that there is not many various ones going around , otherwise if it was so broad and simple to obtain these results then everybody would have some version of their own that would be different to everyone else's. If they are using it everywhere then why would OpenGL implementation need to be so different? just for the sake of being different? clearly its efficient and works well as the mathematics shows so it should do the job for discrete values which computers produce — gettingfaster, Apr 12 '16 at 19:58
algorithms are about speed and efficiency not "lets see how complicated we can be and different to everyone else" its not a creativity contents, its about speed simplicity and efficiency , after all OpenGL programs have to process thousands of vertices not just one vertex, per frame — gettingfaster, Apr 12 '16 at 20:00
@gettingfaster: These algorithms you linked, they are not how GPUs are doing it. You have wrong ideas of how GPUs implement raster conversion. There is not some kind of program executing; the whole rasterization happens in a hugely parallel, out of order, purely hardwired fashion. Also the memory is not organized linearly in a GPU. The data structures and layouts are totally different to what you'd see on a GPU. Memory is not addressed by a linear address by by a tile coordinate. — datenwolf, Apr 12 '16 at 22:06

score 3 · Answer 1 · answered Apr 12 '16 at 20:16

and i do know that OpenGL probably has its own implementations of this process

OpenGL is just a specification document. What runs on a computer is an OpenGL implementation, most of the time as part of a GPU driver. The actual workload is carried out by a GPU…

this also takes place with OpenGL since of the "Scan-Conversion" process of vertices that i have read in OpenGL tutorial

most likely not. As a matter of fact last weekend I was attending a Khronos (the group that specifies OpenGL) event hosted by AMD and one of AMD's GPU engineers was lamenting that newbies have the scanline algorithm in mind with OpenGL, Direct3D, Mantel, Vulkan, etc., while GPUs do something entirely different.

2d array of pixels also known as the default framebuffer that is linear

actually the memory layout of pixels as used internally by the GPU is not linear (i.e. row-by-row) but follows a pattern that gives efficient localized access. For linear access GPUs have extremely efficient copy engines that allow for practically zero overhead conversion between the internal and linear format.

The exact layout used internally is a detail only the GPU engineers have insight into, though. But the fact that memory is not organized linearly but in a localized fashion is also one reason, that the traditional scanline algorithm is not used by GPUs.

So what i am wondering is if that is the case, how would projecting the 3 vertices of a triangle define which pixels are covered in side it?

Any method that satisfies the requirements of the OpenGL specification is allowed. The details are part of the OpenGL implementation, i.e. usually the combination of particular GPU model and driver version.

score 1 · Answer 2 · answered Apr 12 '16 at 20:49

The scanline algorithm is what people did in software back in the 1990s, before modern GPUs. GPU developers figured out rather quickly that the algorithms you use for software rendering are vastly different from the algorithms you would implement in a VLSI implementation with billions of transistors. Algorithms optimized for hardware implementation tend to look fairly alien to anyone who comes from a software background anyway.

Another thing I'd like to clear up is that OpenGL doesn't say anything about "how" you render, it's just "what" you render. OpenGL implementations are free to do it however they please. We can find out "what" by reading the OpenGL standard, but "how" is buried in secrets kept by the GPU vendors.

Finally, before we start, the articles you linked are unrelated. They are about how ultrasonic scans work.

What do we know about scan conversion?

Scan conversion has as input a number of primitives. For our purposes, let's assume that they're all triangles (which is increasingly true these days).
Every triangle must be clipped by the clipping planes. This can add up to three additional sides to the triangle, in the worst case (turning it into a hexagon). This has to happen before perspective projection.
Every primitive must go through perspective projection. This process takes each vertex with homogeneous coordinates (X, Y, Z, W) and converts it to (X/W, Y/W, Z/W).
The framebuffer is usually organized hierarchically into tiles, not linearly like the way you do in software. Furthermore, the processing might be done at more than one hierarchical level. The reason why we use linear organization in software is because it takes extra cycles to compute memory addresses in a hierarchical layout. However, VLSI implementations do not suffer from this problem, they can simply wire up the bits in a register how they want to make an address from it.

So you can see that in software, tiles are "complicated and slow" but in hardware they're "easy and fast".

Some notes looking at the R5xx manual:

The R5xx series is positively ancient (2005) but the documentation is available online (search for "R5xx_Acceleration_v1.5.pdf"). It mentions two scan converters, so the pipeline looks something like this:

primitive output -> coarse scan converter -> quad scan converter -> fragment shader

The coarse scan converter appears to operate on larger tiles of configurable size (8x8 to 32x32), and has multiple selectable modes, an "intercept based" and a "bounding box based" mode.

The quad scan converter then takes the output of the coarse scan converter and outputs individual quads, which are groups of four samples. The depth values for each quad may be represented as four discrete values or as a plane equation. The plane equation allows the entire quad to be discarded quickly if the corresponding quad in the depth buffer also is specified as a plane equation. This is called "early Z" and it is a common optimization.

The fragment shader then works on one quad at a time. The quad might contain samples outside the triangle, which will then get discarded.

It's worth mentioning again that this is an old graphics card. Modern graphics cards are more complicated. For example, the R5xx doesn't even let you sample textures from the vertex shaders.

If you want an even more radically different picture, look up the PowerVR GPU implementations which use something called "tile-based deferred rendering". These modern and powerful GPUs are optimized for low cost and low power consumption, and they challenge a lot of your assumptions about how renderers work.

while your right that the names for all these methods are changed, one thing hasn't changed , and that is the maths behind it all all calculations that happen are based on mathematical ideas, so no matter whether they call it "PowerVR" or "tile based deferred rendering" or anything else for that matter, in the end it all comes down to the old school of maths, and in that school nothing has changed, the methods are still what they are, same and constant http://http.developer.nvidia.com/GPUGems3/gpugems3_ch34.html in the end they always use the mathematical models for all transforms — gettingfaster, Apr 23 '16 at 23:30
the only difference is that a GPU can run everything parallel but in the end does the same thing the cpu does but with a multitude of threads where the works get divided over various transistors — gettingfaster, Apr 23 '16 at 23:34

score -4 · Accepted Answer · edited Nov 26 '19 at 03:11

Quoting from GPU Gems: Parallel Prefix Sum (Scan) with CUDA, it describes how OpenGL does its scan conversion and compares it with CUDA which I think suffices as the answer of my question:

Prior to the introduction of CUDA, several researchers implemented scan using graphics APIs such as OpenGL and Direct3D (see Section 39.3.4 for more). To demonstrate the advantages CUDA has over these APIs for computations like scan, in this section we briefly describe the work-efficient OpenGL inclusive-scan implementation of Sengupta et al. (2006). Their implementation is a hybrid algorithm that performs a configurable number of reduce steps as shown in Algorithm 5. It then runs the double-buffered version of the sum scan algorithm previously shown in Algorithm 2 on the result of the reduce step. Finally it performs the down-sweep as shown in Algorithm 6.

Example 5. The Reduce Step of the OpenGL Scan Algorithm

1: for d = 1 to log2 n do 
2:     for all k = 1 to n/2 d  – 1 in parallel do 
3:          a[d][k] = a[d – 1][2k] + a[d – 1][2k + 1]]
Example 6. The Down-Sweep Step of the OpenGL Scan Algorithm

1: for d = log2 n – 1 down to 0 do 
2:     for all k = 0 to n/2 d  – 1 in parallel do 
3:          if i > 0 then 
4:             if k mod 2 U2260.GIF 0 then 
5:                  a[d][k] = a[d + 1][k/2]
6:             else 
7:                  a[d][i] = a[d + 1][k/2 – 1]

The OpenGL scan computation is implemented using pixel shaders, and each a[d] array is a two-dimensional texture on the GPU. Writing to these arrays is performed using render-to-texture in OpenGL. Thus, each loop iteration in Algorithm 5 and Algorithm 2 requires reading from one texture and writing to another.

The main advantages CUDA has over OpenGL are its on-chip shared memory, thread synchronization functionality, and scatter writes to memory, which are not exposed to OpenGL pixel shaders. CUDA divides the work of a large scan into many blocks, and each block is processed entirely on-chip by a single multiprocessor before any data is written to off-chip memory. In OpenGL, all memory updates are off-chip memory updates. Thus, the bandwidth used by the OpenGL implementation is much higher and therefore performance is lower, as shown previously in Figure 39-7.

Heck, why would you post and accept an answer that doesn't answer your own question? The problem described in the article is not related to how primitives are rasterized in any way. — Yakov Galka, Dec 02 '19 at 23:54

How do fragments get generated by rasterizer in OpenGL

3 Answers3

What do we know about scan conversion?

Some notes looking at the R5xx manual: