Counting coloured pixels on the GPU - Theory

Question

I have a 128 by 128 pixel image.

It's broken down into an 8 by 8 grid.

Each grid block contains 16 by 16 pixels.

Requirement

I want to count how many black pixels my image contains.

The straight forward way:

I could do this by going row by row, column by column, over the whole image and checking if the pixel was black or not.

The GPU way

...but I'd like to know if using the GPU, I could break down the image into chunks/blocks and count all pixels in each block and then sum the results.

For example:

If you look at the top left of the image:

First block, 'A1' (Row A, Column 1) contains a grid of 16 by 16 pixels, I know by counting them manually, there are 16 blacks pixels.

Second block: 'A2', (Row A, Column 2) contains a grid of 16 by 16 pixels, I know by counting them manually, there are 62 blacks pixels.

All other blocks for this example are blank/empty.

If I ran my image through my program, I should get the answer: 16 + 62 = 78 Black pixels.

Reasoning

It's my understanding that the GPU can operate on a lot of data in parallel, effectively running a small program on a chunk of data spread across multiple GPU threads. I'm not worried about speed/performance, I'd just like to know if this is something the GPU can/could do?

This would appear to be a special case of a histogram, something GPUs are well suited for. Try a search for "GPU" + "hsitogram" with your favorite internet search engine. — njuffa, Jun 03 '17 at 17:59
Related: https://stackoverflow.com/questions/15782325/cuda-programming-histogram — sanmai, Jun 23 '17 at 03:18
Even more related: https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/ — sanmai, Jun 23 '17 at 03:18
You know you can do the same in the CPU, right? Nothing prevents you from spawning a thread for each rect and then adding the results. — Fernando Mazzon, Jun 23 '17 at 12:57
@FernandoMazzon - Sure, but I'm specifically looking for an answer that uses the GPU. — Chris, Jun 23 '17 at 19:54
This is tagged as `swift`. What platform are you working on? — pevasquez, Jun 25 '17 at 16:51
@Chris please consider marking my answer as correct if it was helpful. — pevasquez, Jun 30 '17 at 22:34

score 6 · Accepted Answer · answered Jun 26 '17 at 15:59

Indeed, General Purpose GPUs (such as those in Apple devices from the A8 on, for example) are not only capable but also intended to be able to solve such parallel data processing problems.

Apple introduced Data-parallel-processing using Metal in their platforms, and with some simple code you can solve problems like yours using the GPU. Even if this can also be done using other frameworks, I am including some code for the Metal+Swift case as proof of concept.

The following runs as a Swift command line tool on OS X Sierra, and was built using Xcode 9 (yup, I know it's beta). You can get the full project from my github repo.

As main.swift:

import Foundation
import Metal
import CoreGraphics
import AppKit

guard FileManager.default.fileExists(atPath: "./testImage.png") else {
    print("./testImage.png does not exist")
    exit(1)
}

let url = URL(fileURLWithPath: "./testImage.png")
let imageData = try Data(contentsOf: url)

guard let image = NSImage(data: imageData),
    let imageRef = image.cgImage(forProposedRect: nil, context: nil, hints: nil) else {
    print("Failed to load image data")
    exit(1)
}

let bytesPerPixel = 4
let bytesPerRow = bytesPerPixel * imageRef.width

var rawData = [UInt8](repeating: 0, count: Int(bytesPerRow * imageRef.height))

let bitmapInfo = CGBitmapInfo(rawValue: CGImageAlphaInfo.premultipliedFirst.rawValue).union(.byteOrder32Big)
let colorSpace = CGColorSpaceCreateDeviceRGB()

let context = CGContext(data: &rawData,
                        width: imageRef.width,
                        height: imageRef.height,
                        bitsPerComponent: 8,
                        bytesPerRow: bytesPerRow,
                        space: colorSpace,
                        bitmapInfo: bitmapInfo.rawValue)

let fullRect = CGRect(x: 0, y: 0, width: CGFloat(imageRef.width), height: CGFloat(imageRef.height))
context?.draw(imageRef, in: fullRect, byTiling: false)

// Get access to iPhone or iPad GPU
guard let device = MTLCreateSystemDefaultDevice() else {
    exit(1)
}

let textureDescriptor = MTLTextureDescriptor.texture2DDescriptor(
    pixelFormat: .rgba8Unorm,
    width: Int(imageRef.width),
    height: Int(imageRef.height),
    mipmapped: true)

let texture = device.makeTexture(descriptor: textureDescriptor)

let region = MTLRegionMake2D(0, 0, Int(imageRef.width), Int(imageRef.height))
texture.replace(region: region, mipmapLevel: 0, withBytes: &rawData, bytesPerRow: Int(bytesPerRow))

// Queue to handle an ordered list of command buffers
let commandQueue = device.makeCommandQueue()

// Buffer for storing encoded commands that are sent to GPU
let commandBuffer = commandQueue.makeCommandBuffer()

// Access to Metal functions that are stored in Shaders.metal file, e.g. sigmoid()
guard let defaultLibrary = device.makeDefaultLibrary() else {
    print("Failed to create default metal shader library")
    exit(1)
}

// Encoder for GPU commands
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()

// hardcoded to 16 for now (recommendation: read about threadExecutionWidth)
var threadsPerGroup = MTLSize(width:16, height:16, depth:1)
var numThreadgroups = MTLSizeMake(texture.width / threadsPerGroup.width,
                                  texture.height / threadsPerGroup.height,
                                  1);

// b. set up a compute pipeline with Sigmoid function and add it to encoder
let countBlackProgram = defaultLibrary.makeFunction(name: "countBlack")
let computePipelineState = try device.makeComputePipelineState(function: countBlackProgram!)
computeCommandEncoder.setComputePipelineState(computePipelineState)


// set the input texture for the countBlack() function, e.g. inArray
// atIndex: 0 here corresponds to texture(0) in the countBlack() function
computeCommandEncoder.setTexture(texture, index: 0)

// create the output vector for the countBlack() function, e.g. counter
// atIndex: 1 here corresponds to buffer(0) in the Sigmoid function
var counterBuffer = device.makeBuffer(length: MemoryLayout<UInt32>.size,
                                        options: .storageModeShared)
computeCommandEncoder.setBuffer(counterBuffer, offset: 0, index: 0)

computeCommandEncoder.dispatchThreadgroups(numThreadgroups, threadsPerThreadgroup: threadsPerGroup)

computeCommandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// a. Get GPU data
// outVectorBuffer.contents() returns UnsafeMutablePointer roughly equivalent to char* in C
var data = NSData(bytesNoCopy: counterBuffer.contents(),
                  length: MemoryLayout<UInt32>.size,
                  freeWhenDone: false)
// b. prepare Swift array large enough to receive data from GPU
var finalResultArray = [UInt32](repeating: 0, count: 1)

// c. get data from GPU into Swift array
data.getBytes(&finalResultArray, length: MemoryLayout<UInt>.size)

print("Found \(finalResultArray[0]) non-white pixels")

// d. YOU'RE ALL SET!

Also, in Shaders.metal:

#include <metal_stdlib>
using namespace metal;

kernel void
countBlack(texture2d<float, access::read> inArray [[texture(0)]],
           volatile device uint *counter [[buffer(0)]],
           uint2 gid [[thread_position_in_grid]]) {

    // Atomic as we need to sync between threadgroups
    device atomic_uint *atomicBuffer = (device atomic_uint *)counter;
    float3 inColor = inArray.read(gid).rgb;
    if(inColor.r != 1.0 || inColor.g != 1.0 || inColor.b != 1.0) {
        atomic_fetch_add_explicit(atomicBuffer, 1, memory_order_relaxed);
    }
}

I used the question to learn a bit about Metal and data-parallel computing, so most of the code was used as boilerplate from articles online and edited. Please take the time to visit the sources mentioned below for some more examples. Also, the code is pretty much hardcoded for this particular problem, but you shouldn't have a lot of trouble adapting it.

Sources:

http://flexmonkey.blogspot.com.ar/2016/05/histogram-equalisation-with-metal.html

http://metalbyexample.com/introduction-to-compute/

http://memkite.com/blog/2014/12/15/data-parallel-programming-with-metal-and-swift-for-iphoneipad-gpu/

Is the default behavior of Metal Shader is not parallel? – Roi Mulia Jun 20 '21 at 09:25 — Roi Mulia, Jun 20 '21 at 09:25

score 3 · Answer 2 · answered Jun 26 '17 at 00:34

Your Question: I'd just like to know if this is something the GPU can/could do?

The Answer: Yes, GPU can handle your computation. All the numbers look very GPU friendly:

warp size: 32 (16x2)
Maximum number of threads per block: 1024 (8x128) (8x8x16)
Maximum number of threads per multiprocessor: 2048 ...etc

You could try many block/thread configurations to get an optimum performance.

The Procedure: Generally, using the GPU means that you copy data from CPU memory to GPU memory, you then perform calculations on the GPU, and finally you copy back the result to the CPU for further calculations. An important think to consider is that all this data transfer is done through the PCI-e link between CPU and GPU, that is very slow compared to both.

My Opinion: In this case, by the time it takes to copy the image to the GPU memory, you would have the result even if you used a lone CPU thread of computation. This is due because your process is not math/computationally intensive. You are just reading the data and comparing it to a black color and then adding an accumulator or counter to get a total (which itself raises a race condition that you'd have to resolve).

My Advice: If after analyzing (profiling) your whole program you think that this routine of getting the black pixel count is a real bottleneck, try:

a divide-and-conquer recursive algorithm, or
parallelizing your calculations in the multiple CPU cores.

score 2 · Answer 3 · answered Jun 05 '17 at 14:33

There is a lot a GPU can do here.

I am not sure if you are looking for an algorithm here but I can point you to a wide used GPU library which implements an efficient counting procedure. Take a look at the count function within the thrust library: https://thrust.github.io/doc/group__counting.html

It works taking as an input a predicate function. It counts the number of occurrences of the input data which satisfy the predicate.

The follows counts the number of elements in data which are equals to zero.

template <typename T>
struct zero_pixel{
  __host__ __device__ bool operator()(const T &x) const {return x == 0;}
};
thrust::count_if(data.begin(), data.end(), zero_pixel<T>())

A working example here: https://github.com/thrust/thrust/blob/master/testing/count.cu

You should code a predicate which tests whether a pixel is black or not (depending on what a pixel is for you (it could be an RGB triplet and in this case, the predicate should be a bit more elaborate).

I would also linearize the pixels into a linear and iterable data structure (but that depends on what your data actually is).

If you are interested in the histogram approach what you can do is sort the pixels of the image (using any GPU efficient algorithm or, why not the thrust implementation of sort, thrust::sort(...) ) data in order to group equal elements together and then perform a reduction by key thrust::reduce_by_key.

Take a look at this example: https://github.com/thrust/thrust/blob/master/examples/histogram.cu

Note that the histogram approach is a bit more costly because it solves a bigger problem (counts the number of occurrences of all the unique elements).

Counting coloured pixels on the GPU - Theory

3 Answers3

Linked