The first thing you need to understand is that "context switching" is different on GPUs (and, in general, most Heterogeneous architectures) than it is on CPU/Host architectures. When you submit a task to the GPU—in this case, "render my image"—the GPU will solely work on that task until completion.
There's a few details I'm abstracting, naturally: Nvidia hardware will try to schedule smaller tasks on unused cores, and all three major vendors (AMD, Intel, NVidia) have some fine-tuned behaviors which complicate my above generalization, but as a matter of principle, you should assume that any task submitted to the GPU will consume the GPU's entire resources until completed.
On its own, that's not a big problem.
But on Windows (and most consumer Operating Systems), if the GPU spends too much time on a single task, the OS will assume that the GPU isn't responding, and will do one of several different things (or possibly a subset of multiple of them):
- Crash: doesn't happen so much anymore, but on older systems I have bluescreened my computers with over-ambitious Mandelbrot renders
- Reset the driver: which means you'll lose all OpenGL state, and is essentially unrecoverable from the program's perspective
- Abort the operation: Some newer device drivers are clever enough to simply kill the task rather than killing the entire context state. But this can depend on the specific API you're using: my OpenGL/GLSL based Mandelbrot programs tend to crash the driver, whereas my OpenCL programs usually have more elegant failures.
- Let it go to completion, without issue: This will only happen if the GPU in question is not used by the Operating System as its display driver. So this is only an option if you have more than one Graphics card in your system and you explicitly ensure that rendering is happening on the Graphics Card not used by the OS, or if the card being used is a Compute Card that probably doesn't have display drivers associated with it. In OpenGL, this is basically a non-starter, but if you were using OpenCL or Vulkan, this might be a potential work-around.
The exact timing varies, but you should generally assume that if a single task takes more than 2 seconds, it'll crash the program.
So how do you fix this problem? Well, if this were an OpenCL-based render, it would be pretty easy:
std::vector<cl_event> events;
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
int32_t render_start[2] = {x, y};
int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
events.emplace_back();
//I'm abstracting the clSubmitNDKernel call
submit_task(queue, kernel, render_start, render_end, &events.back(), /*...*/);
}
}
clWaitForEvents(queue, events.data(), events.size());
In OpenGL, you can use the same basic principle, but things are made a bit more complicated because of how absurdly abstract the OpenGL model is. Because the drivers are want to bundle together multiple draw calls into a single command to the underlying hardware, you need to explicitly make them behave themselves, or else the driver will bundle them all together, and you'll get the exact same problem even though you've written it to specifically break up the task.
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
int32_t render_start[2] = {x, y};
int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
render_portion_of_image(render_start, render_end);
//The call to glFinish is the important part: otherwise, even breaking up
//the task like this, the driver might still try to bundle everything together!
glFinish();
}
}
The exact appearance of render_portion_of_image
is something you'll need to design yourself, but the basic idea is to specify to the program that only the pixels between render_start
and render_end
are to be rendered.
You might be wondering what the value of KERNEL_SIZE
should be. That's something you'll have to experiment on your own, as it depends entirely on how powerful your graphics card is. The value should be
- Small enough that no single task will ever take more than x quantity of time (I usually go for 50 milliseconds, but as long as you keep it below half a second, it's usually safe)
- Large enough that you're not submitting hundreds of thousands of tiny tasks to the GPU. At a certain point, you'll spend more time synchronizing the Host←→GPU interface than actually doing work on the GPU, and since GPU architectures often have hundreds or even thousands of cores, if your tasks are too small, you'll lose speed simply by not saturating all the cores.
In my personal experience, the best way to determine is to have a bunch of "testing" renders before the program starts, where you render the image at 10,000 iterations of the escape algorithm on a 32x32 image of the central bulb of the Mandelbrot Set (rendered all at once, with no breaking up of the algorithm), and seeing how long it takes. The algorithm I use essentially looks like this:
int32_t KERNEL_SIZE = 32;
std::chrono::nanoseconds duration = 0;
while(KERNEL_SIZE < 2048 && duration < std::chrono::milliseconds(50)) {
//duration_of is some code I've written to time the task. It's best to use GPU-based
//profiling, as it'll be more accurate than host-profiling.
duration = duration_of([&]{render_whole_image(KERNEL_SIZE)});
if(duration < std::chrono::milliseconds(50)) {
if(is_power_of_2(KERNEL_SIZE)) KERNEL_SIZE += KERNEL_SIZE / 2;
else KERNEL_SIZE += KERNEL_SIZE / 3;
}
}
final_kernel_size = KERNEL_SIZE;
The last thing I'd recommend is to use OpenCL for the heavy-duty lifting of rendering the mandelbrot set itself, and use OpenGL (including the OpenGL←→OpenCL Interop API!) to actually display the image on screen. OpenCL is, on a technical level, going to be neither faster nor slower than OpenGL, but it gives you a lot of control over the operations you perform, and it's easier to reason about what the GPU is doing (and what you need to do to alter its behavior) when you're using a more explicit API than OpenGL. You could, if you want to stick to a single API, use Vulkan instead, but since Vulkan is extremely low-level and thus very complicated to use, I don't recommend that unless you're up to the challenge.
EDIT: A few other things:
- I'd have multiple versions of the program, one that renders with
float
s, and the other rendering with double
s. In my version of this program, I actually have a version that uses two float
values to simulate a double
, as described here. On most hardware this can be slower, but on certain architectures (particularly NVidia's Maxwell architecture) if the speed of processing float
s is sufficiently fast enough, it can actually outperform double
simply by sheer magnitude: on some GPU architectures, float
s are 32x faster than double
s.
- You might be tempted to have an "adaptive" algorithm that dynamically adjusts kernel size on the fly. This is more trouble than it's worth, and the time spent on the host reevaluating the next kernel size will outweigh any slight performance gains you otherwise achieve.