Efficiently count how many transparent pixels are in UIImage/CIImage with Metal

Question

What is the fastest way we can count how many transparent pixels exist in CIImage/UIImage?

For example:

My first thought, if we speak about efficiency, is to use Metal Kernel using either CIColorKernel or so, but I can't understand how to use it to output "count".

Also other ideas I had in mind:

Use some kind of average color to calculate it, the "redder" the more filled with pixels? Maybe some kind of linear calculation depends on the image size (using CIAreaAverage CIFilter?
Count pixels one by one and check the RGB values?
Using Metal parallel capabilities, similar to this post: Counting coloured pixels on the GPU - Theory?
Scale down the image and then count? Or do all the other processes suggested above just with scaled than version, and multiple it back depends on the scale down proportions after calculating?

What is the fastest way to achieve this count?

Jeshua Lacock · Answer 1 · 2021-06-22T06:41:26.220

4

To answer your question how to do it metal, you would use device atomic_int.

Essentially you create an Int MTLBuffer and pass it to your kernel and increment it with atomic_fetch_add_explicit.

Create buffer once:

var bristleCounter = 0
counterBuffer = device.makeBuffer(bytes: &bristleCounter, length: MemoryLayout<Int>.size, options: [.storageModeShared])

Reset counter to 0 and binding counter buffer:

var z = 0
counterBuffer.contents().copyMemory(from: &z, byteCount: MemoryLayout<Int>.size)
kernelEncoder.setBuffer(counterBuffer, offset: 0, index: 0)

Kernel:

kernel void myKernel (device atomic_int *counter [[buffer(0)]]) {}

Increment counter in Kernel (and get the value):

int newCounterValue = atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);

Get the counter on the CPU side:

kernelEncoder.endEncoding()
kernelBuffer.commit()
kernelBuffer.waitUntilCompleted()
    
//Counter from kernel now in counterBuffer
let bufPointer = counterBuffer.contents().load(as: Int.self)
print("Counter: \(bufPointer)")

edited Jun 22 '21 at 06:41

answered Jun 22 '21 at 02:22

Jeshua Lacock

5,730
1
28
58

1

The problem with this is that each of the hundreds of GPU cores needs to read and write _the same value_ from the global address space. Even when using `atomic` intrinsics, you still (a) block any parallel execution since only one core can access the value at a time and (b) cause a lot of latency when accessing global memory. – Frank Rupprecht Jun 24 '21 at 13:23
1

You want to race? On any modern chip it is plenty fast even with large images. The question originally asks about how it could be implemented in Metal so it is pertinent even if you think your approach is faster (and I am assuming that is only a guess). – Jeshua Lacock Jun 24 '21 at 21:32
1

Metal is also the most flexible approach. Additional functionality might not be needed at the moment, but with this approach it would be quite straightforward to implement additional capabilities as needed. It is infinitely customizable. – Jeshua Lacock Jun 24 '21 at 21:35
1

I didn't mean to personally offend you, I'm sorry! The question was asking about the most efficient solution, so I thought it right to spend 20 minutes to list the alternatives and discuss their pros and cons. If desired I can probably invest more time to write sample code, but as I said in my answer, it would ideally depend on the surrounding use case (where the data is coming from and where the result is used). – Frank Rupprecht Jun 25 '21 at 09:45
1

And you are right, the solution you provided is definitely working and the compiler and scheduler might help to make it run reasonably fast. However, I still don't think it's a good solution since it violates multiple GPU programming best practices. – Frank Rupprecht Jun 25 '21 at 09:49
1

Well without providing performance comparisons or source code, it's really just all theory. When in practice, in my experience, using metal to do tasks like this is plenty fast for real time applications. It's not that I took your downvote personally, its that this question asks how it could be done in metal and has the metal tag, and I provided complete and fully working code which does not deserve a downvote in the spirit of SO. – Jeshua Lacock Jun 25 '21 at 18:15

Frank Rupprecht · Accepted Answer · 2021-06-25T09:20:54.737

What you want to perform is a reduction operation, which is not necessarily well-suited for the GPU due to its massively parallel nature. I'd recommend not writing a reduction operation for the GPU yourself, but rather use some highly optimized built-in APIs that Apple provides (like CIAreaAverage or the corresponding Metal Performance Shaders).

The most efficient way depends a bit on your use case, specifically where the image comes from (loaded via UIImage/CGImage or the result of a Core Image pipeline?) and where you'd need the resulting count (on the CPU/Swift side or as an input for another Core Image filter?).
It also depends on if the pixels could also be semi-transparent (alpha not 0.0 or 1.0).

If the image is on the GPU and/or the count should be used on the GPU, I'd recommend using CIAreaAverage. The alpha value of the result should reflect the percentage of transparent pixels. Note that this only works if there are now semi-transparent pixels.

The next best solution is probably just iterating the pixel data on the CPU. It might be a few million pixels, but the operation itself is very fast so this should take almost no time. You could even use multi-threading by splitting the image up in chunks and use concurrentPerform(...) of DispatchQueue.

A last, but probably overkill solution would be to use Accelerate (this would make @FlexMonkey happy): Load the image's pixel data into a vDSP buffer and use the sum or average methods to calculate the percentage using the CPU's vector units.

Clarification

When I was saying that a reduction operation is "not necessarily well-suited for the GPU", I meant to say that it's rather complicated to implement in an efficient way and by far not as straightforward as a sequential algorithm.

The check whether a pixel is transparent or not can be done in parallel, sure, but the results need to be gathered into a single value, which requires multiple GPU cores reading and writing values into the same memory. This usually requires some synchronization (and thereby hinders parallel execution) and incurs latency cost due to access to the shared or global memory space. That's why efficient gather algorithms for the GPU usually follow a multi-step tree-based approach. I can highly recommend reading NVIDIA's publications on the topic (e.g. here and here). That's also why I recommended using built-in APIs when possible since Apple's Metal team knows how to best optimize these algorithms for their hardware.

There is also an example reduction implementation in Apple's Metal Shading Language Specification (pp. 158) that uses simd_shuffle intrinsics for efficiently communicating intermediate values down the tree. The general principle is the same as described by NVIDIA's publications linked above, though.

I would add that if you're using Accelerate to effectively do a `popcount` on the alpha channel and your image data is interleaved (i.e. RGBARGBA... rather than planar buffers for each color), there is an overhead to using a non-unit stride. You can easily convert interleaved to planar (see: https://developer.apple.com/documentation/accelerate/optimizing_image-processing_performance) and call `vDSP_sve` on the alpha buffer. — Flex Monkey, Jun 21 '21 at 08:32
...also, Apple have a nice article hat discusses integrating Accelerate into a Core Image workflow: https://developer.apple.com/documentation/accelerate/reading_from_and_writing_to_core_video_pixel_buffers — Flex Monkey, Jun 21 '21 at 08:39
Thanks, Simon! Adding to that: if you only really need a binary mask, Roi, you might want to consider using a single-channel drawing target for it (using `kCGImageAlphaOnly` as bitmap info, for instance). Then you don't need the "interleaved to planar" step Simon mentioned above. — Frank Rupprecht, Jun 21 '21 at 09:09
Counting pixels is indeed a massively parallel operation, so I do not understand why you state it is not suited for a GPU. In-fact, just about any pixel based operation is suitable for the GPU since it can be broken up into individual pixels or kernels. — Jeshua Lacock, Jun 22 '21 at 05:47
You are right, I was a bit too fuzzy with my wording. I added a clarification to my answer. — Frank Rupprecht, Jun 24 '21 at 13:17
Also, you amended your answer, but, it starts out stating demonstrably false information. — Jeshua Lacock, Jun 24 '21 at 20:09
I'm glad that you found a fast solution, Jeshua. But I still don't think that my claim that a reduction operation is not an inherently good match for a SIMD device like the GPU is false. There are definitely ways to implement that in an efficient manner (see Apple's example I added to my answer), but it is not straightforward. That's why I recommended using built-in high-level APIs to do this when possible. — Frank Rupprecht, Jun 25 '21 at 09:31
If it works well enough for realtime applications, it is a good match for the GPU, IMHO. If you had source code, I could compare performance, otherwise it is all just theory. — Jeshua Lacock, Jun 25 '21 at 18:29

score 0 · Answer 3 · answered Dec 24 '21 at 15:23

If the image contains semitransparent pixels, it can be easily preprocessed to make all pixels with alpha below certain threshold fully transparent, or fully opaque otherwise. Then the CIAreaAverage could be applied, as was originally suggested in the question, and finally the approximate number of the fully opaque pixels can be calculated by multiplying alpha component of the result by the image size.

For pre-processing we could use a trivial CIColorKernel like this:

half4 clampAlpha(coreimage::sample_t color) {
    half4 out = half4(color);
    out.a = step(half(0.99), out.a);
    return  out;
}

(Choose whatever threshold you like instead of 0.99)

To get the alpha component out of the output of CIAreaAverage we could do something like this:

        let context = CIContext(options: [.workingColorSpace: NSNull(), .outputColorSpace: NSNull()])
        var color: [Float] = [0, 0, 0, 0]
        context.render(output,
                       toBitmap: &color,
                       rowBytes: MemoryLayout<Float>.size * 4,
                       bounds: CGRect(origin: .zero, size: CGSize(width: 1, height: 1)),
                       format: .RGBAf,
                       colorSpace: nil)

// color[3] contains alpha component of the result

With that approach everything is done on GPU while taking advantage of its inherent parallelism.

BTW, check this app out https://apps.apple.com/us/app/filter-magic/id1594986951. It lets you play with every single CoreImage filter out there.

Efficiently count how many transparent pixels are in UIImage/CIImage with Metal

3 Answers3

Linked