Performant method of drawing text onto a png file?

Question

I need to draw a two-dimensional grid of Squares with centered Text on them onto a (transparent) PNG file. The tiles need to have a sufficiently big resolution, so that the text does not get pixaleted to much.

For testing purposes I create a 2048x2048px 32-bit (transparency) PNG Image with 128x128px tiles like for example that one:

The problem is I need to do this with reasonable performance. All methods I have tried so far took more than 100ms to complete, while I would need this to be at a max < 10ms. Apart from that I would need the program generating these images to be Cross-Platform and support WebAssembly (but even if you have for example an idea how to do this using posix threads, etc. I would gladly take that as a starting point, too).

Net5 Implementation

using System.Diagnostics;
using System;
using System.Drawing;

namespace ImageGeneratorBenchmark
{
    class Program
    {
        static int rowColCount = 16;
        static int tileSize = 128;
        static void Main(string[] args)
        {
            var watch = Stopwatch.StartNew();

            Bitmap bitmap = new Bitmap(rowColCount * tileSize, rowColCount * tileSize);
            Graphics graphics = Graphics.FromImage(bitmap);

            Brush[] usedBrushes = { Brushes.Blue, Brushes.Red, Brushes.Green, Brushes.Orange, Brushes.Yellow };

            int totalCount = rowColCount * rowColCount;
            Random random = new Random();

            StringFormat format = new StringFormat();
            format.LineAlignment = StringAlignment.Center;
            format.Alignment = StringAlignment.Center;

            for (int i = 0; i < totalCount; i++)
            {
                int x = i % rowColCount * tileSize;
                int y = i / rowColCount * tileSize;

                graphics.FillRectangle(usedBrushes[random.Next(0, usedBrushes.Length)], x, y, tileSize, tileSize);
                graphics.DrawString(i.ToString(), SystemFonts.DefaultFont, Brushes.Black, x + tileSize / 2, y + tileSize / 2, format);
            }

            bitmap.Save("Test.png");

            watch.Stop();
            Console.WriteLine($"Output took {watch.ElapsedMilliseconds} ms.");
        }
    }
}

This takes around 115ms on my machine. I am using the System.Drawing.Common nuget here.

Saving the bitmap takes roughly 55ms and drawing to the graphics object in the loop also takes roughly 60ms, while 40ms can be attributed to drawing the text.

Rust Implementation

use std::path::Path;
use std::time::Instant;
use image::{Rgba, RgbaImage};
use imageproc::{drawing::{draw_text_mut, draw_filled_rect_mut, text_size}, rect::Rect};
use rusttype::{Font, Scale};
use rand::Rng;

#[derive(Default)]
struct TextureAtlas {
    segment_size: u16, // The side length of the tile
    row_col_count: u8, // The amount of tiles in horizontal and vertical direction
    current_segment: u32 // Points to the next segment, that will be used 
}

fn main() {
    let before = Instant::now();

    let mut atlas = TextureAtlas {
        segment_size: 128,
        row_col_count: 16,
        ..Default::default()
    };

    let path = Path::new("test.png");
    let colors = vec![Rgba([132u8, 132u8, 132u8, 255u8]), Rgba([132u8, 255u8, 32u8, 120u8]), Rgba([200u8, 255u8, 132u8, 255u8]), Rgba([255u8, 0u8, 0u8, 255u8])];

    let mut image = RgbaImage::new(2048, 2048);

    let font = Vec::from(include_bytes!("../assets/DejaVuSans.ttf") as &[u8]);
    let font = Font::try_from_vec(font).unwrap();

    let font_size = 40.0;
    let scale = Scale {
        x: font_size,
        y: font_size,
    };

    // Draw random color rects for benchmarking
    for i in 0..256 {
        let rand_num = rand::thread_rng().gen_range(0..colors.len());

        draw_filled_rect_mut(
            &mut image, 
            Rect::at((atlas.current_segment as i32 % atlas.row_col_count as i32) * atlas.segment_size as i32, (atlas.current_segment as i32 / atlas.row_col_count as i32) * atlas.segment_size as i32)
                .of_size(atlas.segment_size.into(), atlas.segment_size.into()), 
            colors[rand_num]);

        let number = i.to_string();
        //let text = &number[..];
        let text = number.as_str(); // Somehow this conversion takes ~15ms here for 255 iterations, whereas it should normally only be less than 1us
        let (w, h) = text_size(scale, &font, text);
        draw_text_mut(
            &mut image, 
            Rgba([0u8, 0u8, 0u8, 255u8]), 
            (atlas.current_segment % atlas.row_col_count as u32) * atlas.segment_size as u32 + atlas.segment_size as u32 / 2 - w as u32 / 2, 
            (atlas.current_segment / atlas.row_col_count as u32) * atlas.segment_size as u32 + atlas.segment_size as u32 / 2 - h as u32 / 2, 
            scale, 
            &font, 
            text);

        atlas.current_segment += 1;
    }

    image.save(path).unwrap();

    println!("Output took {:?}", before.elapsed());
}

For Rust I was using the imageproc crate. Previously I used the piet-common crate, but the output took more than 300ms. With the imageproc crate I got around 110ms in release mode, which is on par with the C# version, but I think it will perform better with webassembly.

When I used a static string instead of converting the number from the loop (see comment) I got below 100ms execution time. For Rust drawing to the image only takes around 30ms, but saving it takes 80ms.

C++ Implementation

#include <iostream>
#include <cstdlib>
#define cimg_display 0
#define cimg_use_png
#include "CImg.h"
#include <chrono>
#include <string>

using namespace cimg_library;
using namespace std;

/* Generate random numbers in an inclusive range. */
int random(int min, int max)
{
    static bool first = true;
    if (first)
    {
        srand(time(NULL));
        first = false;
    }
    return min + rand() % ((max + 1) - min);
}

int main() {
    auto t1 = std::chrono::high_resolution_clock::now();

    static int tile_size = 128;
    static int row_col_count = 16;

    // Create 2048x2048px image.
    CImg<unsigned char> image(tile_size*row_col_count, tile_size*row_col_count, 1, 3);

    // Make some colours.
    unsigned char cyan[] = { 0, 255, 255 };
    unsigned char black[] = { 0, 0, 0 };
    unsigned char yellow[] = { 255, 255, 0 };
    unsigned char red[] = { 255, 0, 0 };
    unsigned char green[] = { 0, 255, 0 };
    unsigned char orange[] = { 255, 165, 0 };

    unsigned char colors [] = { // This is terrible, but I don't now C++ very well.
        cyan[0], cyan[1], cyan[2],
        yellow[0], yellow[1], yellow[2],
        red[0], red[1], red[2],
        green[0], green[1], green[2],
        orange[0], orange[1], orange[2],
    };

    int total_count = row_col_count * row_col_count;

    for (size_t i = 0; i < total_count; i++)
    {
        int x = i % row_col_count * tile_size;
        int y = i / row_col_count * tile_size;

        int random_color_index = random(0, 4);
        unsigned char current_color [] = { colors[random_color_index * 3], colors[random_color_index * 3 + 1], colors[random_color_index * 3 + 2] };

        image.draw_rectangle(x, y, x + tile_size, y + tile_size, current_color, 1.0); // Force use of transparency. -> Does not work. Always outputs 24bit PNGs.

        auto s = std::to_string(i);

        CImg<unsigned char> imgtext;
        unsigned char color = 1;
        imgtext.draw_text(0, 0, s.c_str(), &color, 0, 1, 40); // Measure the text by drawing to an empty instance, so that the bounding box will be set automatically.

        image.draw_text(x + tile_size / 2 - imgtext.width() / 2, y + tile_size / 2 - imgtext.height() / 2, s.c_str(), black, 0, 1, 40);
    }

    // Save result image as PNG (libpng and GraphicsMagick are required).
    image.save_png("Test.png");

    auto t2 = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();

    std::cout << "Output took " << duration << "ms.";
    getchar();
}

I also reimplemented the same program in C++ using CImg. For .png output libpng and GraphicsMagick are required, too. I am not very fluent in C++ and I did not even bother optimizing, because the save operation took ~200ms in Release mode, whereas the whole Image generation which is currently very unoptimized took only 30ms. So this solution also falls way short of my goal.

Where I am right now

A graph of where I am right now. I will update this when I make some progress.

Why I am trying to do this and why it bothers me so much

I was asked in the comments to give a bit more context. I know this question is getting a big bloated, but if you are interested read on...

So basically I need to build a Texture Atlas for a .gltf file. I need to generate a .gltf file from data and the primitives in the .gltf file will be assigned a texture based on the input data, too. In order to optimize for a small amount of draw calls I am putting as much geometry as possible into one single primitive and then use texture coordinates to map the texture to the model. Now GPUs have a maximum size, that the texture can have. I will use 2048x2048 pixels, because the majority of devices supports at least that. That means, that if I have more than 256 objects, I need to add a new primitive to the .gltf and generate another texture atlas. In some cases one texture atlas might be sufficient, in other cases I need up to 15-20.

The textures will have a (semi-)transparent background, maybe text and maybe some lines / hatches or simple symbols, that can be drawn with a path.

I have the whole system set up in Rust already and the .gltf generating is really efficient: I can generate 54000 vertecies (=1500 boxes for example) in about 10ms which is a common case. Now for this I need to generate 6 texture atlases, which is not really a problem on a multi-core system (7 threads one for the .gltf, six for the textures). The problem is generating one takes about 100ms (or now 55 ms) which makes the whole process more than 5 times slower.

Unfortunatly it gets even worse, because another common case is 15000 objects. Generating the vertecies (plus a lot of custom attributes actually) and assembling the .gltf still only takes 96ms (540000 Vertecies / 20MB .gltf), but in that time I need to generate 59 texture atlases. I am working on a 8-core System, so at that point it gets impossible for me to run them all in parallel and I will have to generate ~9 atlases per thread (which means 55ms*9 = 495ms) so again this is 5 times as much and actually creates a quite noticeable lag. In reality it currently takes more than 2.5 s, because I am have updated to use the faster code and there seems to be additional slowdown.

What I need to do

I do understand that it will take some time to write out 4194304 32-bit pixels. But as far as I can see, because I am only writing to different parts of the image (for example only to the upper tile and so on) it should be possible to build a program that does this using multiple threads. That is what I would like to try and I would take any hint on how to make my Rust program run faster.

If it helps I would also be willing to rewrite this in C or any other language, that can be compiled to wasm and can be called via Rust's FFI. So if you have suggestions for more performant libraries I would be very thankful for that too.

Edit

Update 1: I made all the suggested improvements for the C# version from the comments. Thanks for all of them. It is now at 115ms and almost exactly as fast as the Rust version, which makes me believe I am sort of hitting a dead-end there and I would really need to find a way to parallize this in order to make significant further improvements...

Update 2: Thanks to @pinkfloydx33 I was able to run the binary with around 60ms (including the first run) after publishing it with dotnet publish -p:PublishReadyToRun=true --runtime win10-x64 --configuration Release.

In the meantime I also tried other methods myself, namely Python with Pillow (~400ms), C# and Rust both with Skia (~314ms and ~260ms) and I also reimplemented the program in C++ using CImg (and libpng as well as GraphicsMagick).

I like all the effort put into this question. However, it is sadly off-topic. — vallentin, Feb 17 '21 at 11:53
Can you tell whether the FillRectangle or the DrawString is significantly more expensive? Try using only either one for a test. — PMF, Feb 17 '21 at 11:56
What is your purpose in doing this? Do you want to implement it in C# language or another language? — Meysam Asadi, Feb 17 '21 at 11:58
Sorry if I didn’t mention this to clear, but I would want to compile it to wasm from any language, that can do so. — frankenapps, Feb 17 '21 at 11:59
Not that it's gonna help much but you can cache the calculation of `Math.Pow()` outside the loop rather than recalculation each time. `i.ToString` cached maybe? Standard practice: avoid creating a `new Random`, especially in each iteration. Create it outside the loop or as a static field. I **doubt** it would be faster than GDI/native but you maybe could `LockBits` and manually color in the rectangles and use native for the text only — pinkfloydx33, Feb 17 '21 at 12:00
You could try to replace the measuring part by using a centerd stringformat and the rectangle — TaW, Feb 17 '21 at 12:06
You might also try this: https://stackoverflow.com/a/26498/491907 to just center the text vertically and horizontally rather than recalculation each time — pinkfloydx33, Feb 17 '21 at 12:06
@pinkfloydx33: LockBits/UnlockBits is orders of magnitude faster than using the default drawing methods, but of course you can't use the GDI functions then. For drawing rectangles, that would likely do, but for drawing text that's quite a bit of work to do manually. — PMF, Feb 17 '21 at 12:12
Your original solution runs at ~380ms on my machine; I can get it down to ~70ms (on my machine) just using the suggestions from above. You might want to try that out and see how it fares (looks like your machine is likely better than mine). I tried using LockBits+unsafe to draw the rectangles and then GDI for the test but it does worse — pinkfloydx33, Feb 17 '21 at 13:18
I used all suggestions except LockBits and Updated the question (115ms) with text and rects via GDI. When you ran the original solution, did you try running it multiple times? There seems to be some caching or whatever that brought it down to 140 ms after it initially took 380ms for me, too. — frankenapps, Feb 17 '21 at 13:22
Don't use threads. With these kinds of timings, you would loose more time synchronizing the threads than you could possibly win from parallelizing over multiple cores. — Jmb, Feb 17 '21 at 13:32
Yes you have to run it multiple times for the JIT to kick in. Run each method about 15-20 times in a loop FIRST then discard and run your method again (poor man's benchmark). Also make sure you are running a RELEASE build and you are NOT doing it from within visual studio (run from command line) — pinkfloydx33, Feb 17 '21 at 13:36
Also, you should use the overloads of both `FillRectangle` and `DrawStaring` that accept a `Rectangle`. For some reason I got better results there — pinkfloydx33, Feb 17 '21 at 13:37
@pinkfloydx33 Yes I do run in Release mode (but there is no difference time-wise) and run it from the terminal. Interestingly using the `Rectangle` and `PointF` (for `DrawString`) overloads had no effect for me. — frankenapps, Feb 17 '21 at 13:46
Here's a profile of the Rust version: https://share.firefox.dev/3audyzg. You can see that the bulk of the time is spent saving the png and not doing the drawing. — Jeff Muizelaar, Feb 17 '21 at 15:10
Changing the program to output a '.bmp' file brought the Rust run time down from >100ms to 61ms but half the time is still spent saving the .bmp file. https://share.firefox.dev/37nqLYE — Jeff Muizelaar, Feb 17 '21 at 15:16
@JeffMuizelaar Yes, changing it to .bmp definetly saves about 25%. Unfortunately that does not really help me, because I need a PNG or JPG (takes >200ms). It is interesting though, that the .bmp has 16MB of data written out in 33ms, while it takes 60ms for the 388KB .png. By the way: .tga files are another 10 - 15% faster for whatever reason (file size is exactly the same as with .bmp). — frankenapps, Feb 17 '21 at 15:29
I've been able to get it down on my machine to 40-50ms by using LockBits, and Span by Filling a buffer of the width of the image x 1px and copying it `size` times and repeating. Gonna mess with it some more after work and see if I can tweak anything more out of it. But I suspect it's gonna always be the Save step that kills it — pinkfloydx33, Feb 17 '21 at 20:12
@frankenapps I've got it now to 4-5ms for **all** of the drawing (boxes and text), though this depended on using a generic monospace font rather than the system default. The saving still takes ~35ms which I was able to tweak a bit by saving to a `FileStream` using a custom buffer size. If these sound Ok to you, I'll post an answer but I don't think you're going to get any faster... the Save to file or stream is all handled by native interop — pinkfloydx33, Feb 17 '21 at 23:15
@pinkfloydx33 Yes, please. As I said I gladly take any help I can get. At this point it's less than a third of what it has been, so I think that is already some quite impressive progress, thank you very much. I still don't quite understand why it takes 35ms to write out ~100KB of data. When I can write a 500KB File like this `byte[] bytes = new byte [500000]; Array.Fill(bytes, 20); System.IO.File.WriteAllBytesAsync("Data.png", bytes);` in 5ms. It seems there still is something going on behind the scenes... — frankenapps, Feb 18 '21 at 07:40
@pinkfloydx33 How dynamic is this whole thing? Meaning, are there sets of outputs with the same number of squares, text, etc.? Caching images might be an option if it's not too dynamic in nature. e.g. if the word HELLO appears many times (in or across images) and it can be on a blue background, well there's a nice chunk of processing saved. — Kit, Feb 18 '21 at 14:57
@Kit I think your asking the wrong person... I'm not the OP. But that's essentially how my answer below sped things up. Figure out what 1px high looks like then block copy for the next N rows. I have no clue if the OP is really just generating random boxes or what though — pinkfloydx33, Feb 18 '21 at 15:31
@frankenapps Care to comment re my comments above? I'm just curious at this point. — Kit, Feb 18 '21 at 16:15
@Kit At that point, the question went so far off-topic, I did not mind to also include a pretty thorough explanation, on why I am trying to do this. — frankenapps, Feb 18 '21 at 17:06
I tried to use Direct2D to at least get good performance on windows, but without great success. My last ressort for now is to maybe try to use the .ktx texture format. But I wont have much time for now to move on with this... — frankenapps, Feb 18 '21 at 17:30
@frankenapps I was able to get this down to <20ms on my machine. But I ended up using an indexed pixel format. The numbers arent't as clear as they could be, but you could likely tweak it out. Since you can't use graphics object on indexed bitmap I just drew a second bitmap with text and then looped over the pixels translating them onto the indexed form. Because of the slightly smaller size of the format and no GDI objects, saving the image is a lot faster. If that's of interest to you I can amend my answer. — pinkfloydx33, Feb 28 '21 at 12:50
Wow, Yes I am definitely still interested. As I said I did not have much time for this lately, but I still haven’t been able to get much further with this and I will have to optimize it eventually because it still bothers me a lot. Thank you for all the time you have invested to help me with this topic. — frankenapps, Feb 28 '21 at 13:54

pinkfloydx33 · Accepted Answer · 2021-03-03T11:54:03.770

I was able to get all of the drawing (creating the grid and the text) down to 4-5ms by:

Caching values where possible (Random, StringFormat, Math.Pow)
Using ArrayPool for scratch buffer
Using the DrawString overload accepting a StringFormat with the following options:
- Alignment and LineAlignment for centering (in lieu of manually calculating)
- FormatFlags and Trimming options that disable things like overflow/wrapping since we are just writing small numbers (this had an impact, though negligible)
Using a custom Font from the GenericMonospace font family instead of SystemFonts.DefaultFont
- This shaved off ~15ms
Fiddling with various Graphics options, such as TextRenderingHint and SmoothingMode
- I got varying results so you may want to fiddle some more
An array of Color and the ToArgb function to create an int representing the 4x bytes of the pixel's color
Using LockBits, (semi-)unsafe code and Span to
- Fill a buffer representing 1px high and size * countpx wide (the entire image width) with the int representing the ARGB values of the random colors
- Copy that buffer size times (now representing an entire square in height)
- Rinse/Repeat
- unsafe was required to create a Span<> from the locked bit's Scan0 pointer
Finally, using GDI/native to draw the text over the graphic

I was then able to shave a little bit of time off of the actual saving process by using the Image.Save(Stream) overload. I used a FileStream with a custom buffer-size of 16kb (over the default 4kb) which seemed to be the sweet spot. This brought the total end-to-end time down to around 40ms (on my machine).

private static readonly Random Random = new();
private static readonly Color[] UsedColors = { Color.Blue, Color.Red, Color.Green, Color.Orange, Color.Yellow };
private static readonly StringFormat Format = new()
{
    Alignment = StringAlignment.Center, 
    LineAlignment = StringAlignment.Center,
    FormatFlags = StringFormatFlags.NoWrap | StringFormatFlags.FitBlackBox | StringFormatFlags.NoClip,
    Trimming = StringTrimming.None, HotkeyPrefix = HotkeyPrefix.None
};

private static unsafe void DrawGrid(int count, int size, bool save)
{

    var intsPerRow = size * count;
    var sizePerFullRow = intsPerRow * size;
    var colorsLen = UsedColors.Length;

    using var bitmap = new Bitmap(intsPerRow, intsPerRow, PixelFormat.Format32bppArgb);

    var bmpData = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height), ImageLockMode.WriteOnly, PixelFormat.Format32bppArgb);

    var byteSpan = new Span<byte>(bmpData.Scan0.ToPointer(), Math.Abs(bmpData.Stride) * bmpData.Height);
    var intSpan = MemoryMarshal.Cast<byte, int>(byteSpan);

    var arr = ArrayPool<int>.Shared.Rent(intsPerRow);
    var buff = arr.AsSpan(0, intsPerRow);

    for (int y = 0, offset = 0; y < count; ++y)
    {
        // fill buffer with an entire 1px row of colors
        for (var bOffset = 0; bOffset < intsPerRow; bOffset += size)
            buff.Slice(bOffset, size).Fill(UsedColors[Random.Next(0, colorsLen)].ToArgb());

        // duplicate the pixel high row until we've created a row of squares in full
        var len = offset + sizePerFullRow;
        for ( ; offset < len; offset += intsPerRow)
            buff.CopyTo(intSpan.Slice(offset, intsPerRow));
    }

    ArrayPool<int>.Shared.Return(arr);

    bitmap.UnlockBits(bmpData);

    using var graphics = Graphics.FromImage(bitmap);

    graphics.TextRenderingHint = TextRenderingHint.ClearTypeGridFit;

    // some or all of these may not even matter?
    // you may try removing/modifying the rest
    graphics.CompositingQuality = CompositingQuality.HighSpeed;
    graphics.InterpolationMode = InterpolationMode.Default;
    graphics.SmoothingMode = SmoothingMode.HighSpeed;
    graphics.PixelOffsetMode = PixelOffsetMode.HighSpeed;
    
    var font = new Font(FontFamily.GenericMonospace, 14, FontStyle.Regular);

    var lenSquares = count * count;
    for (var i = 0; i < lenSquares; ++i)
    {
        var x = i % count * size;
        var y = i / count * size;

        var rect = new Rectangle(x, y, size, size);
        graphics.DrawString(i.ToString(), font, Brushes.Black, rect, Format);   
    }

    if (save)
    {
        using var fs = new FileStream("Test.png", FileMode.Create, FileAccess.Write, FileShare.Write, 16 * 1024);
        bitmap.Save(fs, ImageFormat.Png);
    }
}

Here are the timings (in ms) using a StopWatch in Release mode, run outside of Visual Studio. At least the first 1 or 2 timings should be ignored since the methods aren't fully jitted yet. Your mileage will vary depending on your PC, etc.

Image generation only:

Elapsed: 38
Elapsed: 6
Elapsed: 4
Elapsed: 4
Elapsed: 4
Elapsed: 4
Elapsed: 5
Elapsed: 4
Elapsed: 5
Elapsed: 4
Elapsed: 4

Image Generation and saving:

Elapsed: 95
Elapsed: 48
Elapsed: 41
Elapsed: 40
Elapsed: 37
Elapsed: 42
Elapsed: 42
Elapsed: 39
Elapsed: 38
Elapsed: 40
Elapsed: 41

I don't think there is anything that can be done about the slow save. I reviewed the source code of Image.Save. It calls into Native/GDI, passing in a Handle to the Stream, the native image pointer and the Guid representing PNG's ImageCodecInfo (encoder). Any slowness is going to be on that end. Update: I have verified that you get the same slow speed when saving to a MemoryStream so this has nothing to do with the fact you are saving to a file and everything to do with what's going on behind the scenes with GDI/native.

I also attempted to get the Image drawing down further using direct unsafe (pointers) and/or tricks with Unsafe and MemoryMarshal (ex. CopyBlock) as well as unrolling the loops. Those methods either produced identical results or worse and made things a bit harder to follow.

Note: Publishing as a console application with PublishReadyToRun=true seems to help a bit as well.

Update

I realize that the above is just an example, so this may not apply to your end goal. Upon further, extensive review I found that the bulk of the time spent is actually part of Image::Save. It doesn't matter what type of Stream we are saving to, even MemoryStream exhibits the same slowness (obviously disregarding file I/O). I am confident this is related to having GDI objects in the Image/Graphics--in our case the text from DrawString.

As a "simple" test I updated the above so that drawing of the text happened on a secondary image of all white. Without saving that image, I then looped over its individual pixels and based on the rough color (since we have aliasing to deal with) I manually set the corresponding pixel on the primary bitmap. The entire end to end process took sub 20ms on my machine. The rendered image wasn't perfect since it was a quick test, but it proves that you can do parts of this manually and still achieve really low times. The problem is the text drawing but we can leverage GDI without actually using it in our final image. You just need to find the sweet spot. I also tried using an indexed format and populating the pallette with colors beforehand also appeared to help some. Anyways, just food for thought.

May I ask what .Net version you use? I use .Net5 and run the program from the command line with `dotnet run --configuration Release` and I get an overall best time with saving of 103ms and without a minimum of 40ms (so 60ms for saving). I doubt its only my machine, which is that much worse. I am on Windows 10 btw. — frankenapps, Feb 18 '21 at 12:02
Net5.0 in release mode on a 4-5 year old i7 3.4ghz. How are you timing? You need to loop over the method calling it many times. The first few results are going to be slow because the JIT hasn't optimized/replaced the method body yet — pinkfloydx33, Feb 18 '21 at 12:05
Ah, I see. I thought calling the program multiple times would be sufficient, but its not. When I call the method 20 times before measuring it takes 60ms. — frankenapps, Feb 18 '21 at 12:07
Correct the JIT doesn't happen between program runs, it's within a single run of the application. — pinkfloydx33, Feb 18 '21 at 12:08
Actually that gave me another idea. I will try to build it using `CoreRT` and see if it maybe performs even better on the first run then. — frankenapps, Feb 18 '21 at 12:09
If you're building a net5 application (vs library) consider compiling it with ReadyToRun enabled. https://learn.microsoft.com/en-us/dotnet/core/deploying/ready-to-run — pinkfloydx33, Feb 18 '21 at 12:10
Also the chosen font, font size and TextRenderingHint had definite impacts. You can try changing the hint to not use clear text and pick a smaller font for example. The buffer size of the file stream could also impact saving. 16kb seemed to be the sweet spot for me, but you might try 8k or 32kb for example — pinkfloydx33, Feb 18 '21 at 12:13
@frankenapps please keep me posted on your results. Regretfully I've spent a bit too much time on this already so I can't really invest any more to try and tune it further. However I find myself somewhat obsessed with the outcome and hope that you hit your goal (or close) — pinkfloydx33, Feb 18 '21 at 12:20
I have just updated the question with my most recent findings. Unfortunately all I tried so far did not get me anywhere near where I am now thanks to your answer. I totally understand, that you already invested way to much of your time. Thank you very much. This is already more than I excpected to be possible. — frankenapps, Feb 18 '21 at 12:42
@frankenapps I'm out for my morning walk. I've made a few more micro optimizations that I missed. Please check the update (hopefully it still compiles! If not I'll rollback when I get home) — pinkfloydx33, Feb 18 '21 at 12:46
There was a compilation error that I fixed. It now runs slightly faster! — pinkfloydx33, Feb 18 '21 at 12:50
@frankenapps `FontFamily.GenericSansSerif` seems to be a little quicker though I'm not keen on the quality. A smaller font sped up as well (if that's ok for you). Also note, that the default buffer size for `Stream.CopyTo` is `81920` suggesting that it *could* be a better buffer size in your case for the `FileStream`. Anyways, I urge you to tweak the various properties that I've commented on both in code comments and *these* comments to see what/where you can improve — pinkfloydx33, Feb 18 '21 at 13:01