OpenGL Crashes With Heavy Calculation

Question

I am new to OpenGL. My first project consists on rendering a mandelbrot set (which I find quite fascinating) and due to the nature of the calculations that have to be done I thought it would be better to do them on the GPU (basically I apply a complex function on each point of a part of the complex plane, a lot of time, and I color this point based on the output : lots of parallelizable calculations, that seems nice for a GPU, right ?).

So everything works well when there aren't too many calculations for a single image but as soon as pixels*iterations go past about 9 billions, the program crashes (the image displayed shows that only part of it has been calculated, the cyan part is the initial background) :

Dark Part of the Mandelbrot Set not Fully Calculated

In fact, if total number of calculations is below this limit but close enough (say 8.5 billion) it will still crash but it will take way more time. So I guess that there is some kind of problem which doesn't appear at sufficiently small number of calculations (it has always worked flawlessly until it got there). I have really no idea of what it could be, since I am really new to this. When the program crashes it says : "Unhandled exception at 0x000000005DA6DD38 (nvoglv64.dll) in Mandelbrot Set.exe: Fatal program exit requested.". It is also the same address that is specified there (it only changes when I exit Visual Studio, my IDE).

Well here is the whole code, plus the shader files (vertex shader isn't doing anything, all calculations are in the fragment shader) : EDIT : Here is a link to all the .cpp and .h files of the project, the code is too large to be placed here and is correct anyway (though far from perfect) ; https://github.com/JeffEkaka/Mandelbrot/tree/master

Here are the shaders :

NoChanges.vert (vertex shader)

#version 400

// Inputs
in vec2 vertexPosition;  // 2D vec.
in vec4 vertexColor;

out vec2 fragmentPosition;
out vec4 fragmentColor;

void main() {
gl_Position.xy = vertexPosition;
gl_Position.z = 0.0;
gl_Position.w = 1.0;  // Default.

fragmentPosition = vertexPosition;

fragmentColor = vertexColor;

}

CalculationAndColorShader.frag (fragment shader)

#version 400
uniform int WIDTH;
uniform int HEIGHT;

uniform int iter;

uniform double xmin;
uniform double xmax;
uniform double ymin;
uniform double ymax;

void main() {
dvec2 z, c;

c.x = xmin + (double(gl_FragCoord.x) * (xmax - xmin) / double(WIDTH));
c.y = ymin + (double(gl_FragCoord.y) * (ymax - ymin) / double(HEIGHT));

int i;
z = c;
for(i=0; i<iter; i++) {
    double x = (z.x * z.x - z.y * z.y) + c.x;
    double y = (z.y * z.x + z.x * z.y) + c.y;

    if((x * x + y * y) > 4.0) break;
    z.x = x;
    z.y = y;
}

float t = float(i) / float(iter);
float r = 9*(1-t)*t*t*t;
float g = 15*(1-t)*(1-t)*t*t;
float b = 8.5*(1-t)*(1-t)*(1-t)*t;

gl_FragColor = vec4(r, g, b, 1.0);

}

I am using SDL 2.0.5 and glew 2.0.0, and last version of OpenGL I believe. Code has been compiled on Visual Studio (MSVC compiler I believe) with some optimizations enabled. Also, I am using doubles even in my gpu calculations (I know they are ultra-slow but I need their precision).

That's a humongous amount of code. Please present a [MCVE] if you expect free help fixing your code! — Lightness Races in Orbit, Feb 23 '17 at 17:21
I don't see anything immediately obvious that would cause the crash. Do you have a debug callstack of when the crash occurs? And of course I may have easily missed something since, as Lightness mentioned, it is a lot of code to skim through. — ssell, Feb 23 '17 at 17:27
And in the future, if you happen to have this code on a public repo it may serve better to provide a link to it and only put the most relevant code in the question. — ssell, Feb 23 '17 at 17:29
@ssell Agreed. Even better there are online IDE's like [Wandbox](http://melpon.org/wandbox) that even allow for others instantly experimenting with the code. — πάντα ῥεῖ, Feb 23 '17 at 17:37
In `CalculationAndColorShader.frag`, you have this: `uniform int iter;`...Then `for(i=0; i — PaulMcKenzie, Feb 23 '17 at 17:38
@PaulMcKenzie Dividing by zero in the fragment shader is safe (will not crash) but will either result in undefined behavior (GLSL pre-4.1) or `Inf` (GLSL 4.1+). HLSL shaders will also not crash from a division by 0. Also he passes in the value of `iter` in `Plane.cpp` using `glUniform1i` (assuming it is called correctly). — ssell, Feb 23 '17 at 17:42

Xirema · Accepted Answer · 2017-02-23T18:14:32.160

The first thing you need to understand is that "context switching" is different on GPUs (and, in general, most Heterogeneous architectures) than it is on CPU/Host architectures. When you submit a task to the GPU—in this case, "render my image"—the GPU will solely work on that task until completion.

There's a few details I'm abstracting, naturally: Nvidia hardware will try to schedule smaller tasks on unused cores, and all three major vendors (AMD, Intel, NVidia) have some fine-tuned behaviors which complicate my above generalization, but as a matter of principle, you should assume that any task submitted to the GPU will consume the GPU's entire resources until completed.

On its own, that's not a big problem.

But on Windows (and most consumer Operating Systems), if the GPU spends too much time on a single task, the OS will assume that the GPU isn't responding, and will do one of several different things (or possibly a subset of multiple of them):

Crash: doesn't happen so much anymore, but on older systems I have bluescreened my computers with over-ambitious Mandelbrot renders
Reset the driver: which means you'll lose all OpenGL state, and is essentially unrecoverable from the program's perspective
Abort the operation: Some newer device drivers are clever enough to simply kill the task rather than killing the entire context state. But this can depend on the specific API you're using: my OpenGL/GLSL based Mandelbrot programs tend to crash the driver, whereas my OpenCL programs usually have more elegant failures.
Let it go to completion, without issue: This will only happen if the GPU in question is not used by the Operating System as its display driver. So this is only an option if you have more than one Graphics card in your system and you explicitly ensure that rendering is happening on the Graphics Card not used by the OS, or if the card being used is a Compute Card that probably doesn't have display drivers associated with it. In OpenGL, this is basically a non-starter, but if you were using OpenCL or Vulkan, this might be a potential work-around.

The exact timing varies, but you should generally assume that if a single task takes more than 2 seconds, it'll crash the program.

So how do you fix this problem? Well, if this were an OpenCL-based render, it would be pretty easy:

std::vector<cl_event> events;
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
    for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
        int32_t render_start[2] = {x, y};
        int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
        events.emplace_back();
        //I'm abstracting the clSubmitNDKernel call
        submit_task(queue, kernel, render_start, render_end, &events.back(), /*...*/);
    }
}

clWaitForEvents(queue, events.data(), events.size());

In OpenGL, you can use the same basic principle, but things are made a bit more complicated because of how absurdly abstract the OpenGL model is. Because the drivers are want to bundle together multiple draw calls into a single command to the underlying hardware, you need to explicitly make them behave themselves, or else the driver will bundle them all together, and you'll get the exact same problem even though you've written it to specifically break up the task.

for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
    for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
        int32_t render_start[2] = {x, y};
        int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
        render_portion_of_image(render_start, render_end);
        //The call to glFinish is the important part: otherwise, even breaking up 
        //the task like this, the driver might still try to bundle everything together!
        glFinish();
    }
}

The exact appearance of render_portion_of_image is something you'll need to design yourself, but the basic idea is to specify to the program that only the pixels between render_start and render_end are to be rendered.

You might be wondering what the value of KERNEL_SIZE should be. That's something you'll have to experiment on your own, as it depends entirely on how powerful your graphics card is. The value should be

Small enough that no single task will ever take more than x quantity of time (I usually go for 50 milliseconds, but as long as you keep it below half a second, it's usually safe)
Large enough that you're not submitting hundreds of thousands of tiny tasks to the GPU. At a certain point, you'll spend more time synchronizing the Host←→GPU interface than actually doing work on the GPU, and since GPU architectures often have hundreds or even thousands of cores, if your tasks are too small, you'll lose speed simply by not saturating all the cores.

In my personal experience, the best way to determine is to have a bunch of "testing" renders before the program starts, where you render the image at 10,000 iterations of the escape algorithm on a 32x32 image of the central bulb of the Mandelbrot Set (rendered all at once, with no breaking up of the algorithm), and seeing how long it takes. The algorithm I use essentially looks like this:

int32_t KERNEL_SIZE = 32;
std::chrono::nanoseconds duration = 0;
while(KERNEL_SIZE < 2048 && duration < std::chrono::milliseconds(50)) {
    //duration_of is some code I've written to time the task. It's best to use GPU-based 
    //profiling, as it'll be more accurate than host-profiling.
    duration = duration_of([&]{render_whole_image(KERNEL_SIZE)});
    if(duration < std::chrono::milliseconds(50)) {
        if(is_power_of_2(KERNEL_SIZE)) KERNEL_SIZE += KERNEL_SIZE / 2;
        else KERNEL_SIZE += KERNEL_SIZE / 3;
    }
}

final_kernel_size = KERNEL_SIZE;

The last thing I'd recommend is to use OpenCL for the heavy-duty lifting of rendering the mandelbrot set itself, and use OpenGL (including the OpenGL←→OpenCL Interop API!) to actually display the image on screen. OpenCL is, on a technical level, going to be neither faster nor slower than OpenGL, but it gives you a lot of control over the operations you perform, and it's easier to reason about what the GPU is doing (and what you need to do to alter its behavior) when you're using a more explicit API than OpenGL. You could, if you want to stick to a single API, use Vulkan instead, but since Vulkan is extremely low-level and thus very complicated to use, I don't recommend that unless you're up to the challenge.

EDIT: A few other things:

I'd have multiple versions of the program, one that renders with floats, and the other rendering with doubles. In my version of this program, I actually have a version that uses two float values to simulate a double, as described here. On most hardware this can be slower, but on certain architectures (particularly NVidia's Maxwell architecture) if the speed of processing floats is sufficiently fast enough, it can actually outperform double simply by sheer magnitude: on some GPU architectures, floats are 32x faster than doubles.
You might be tempted to have an "adaptive" algorithm that dynamically adjusts kernel size on the fly. This is more trouble than it's worth, and the time spent on the host reevaluating the next kernel size will outweigh any slight performance gains you otherwise achieve.

Well this was a fantastic answer : clear and thorough ! Thank you. — Jeff Ekaka, Feb 23 '17 at 19:45

OpenGL Crashes With Heavy Calculation

1 Answers1