I am aware of MPSImageHistogram, but I'd like to implement the algorithm myself to understand Metal better. However, I run into thread synchronization problem when trying to accumulate pixel information into histogram bins, and I got no clue how to solve it. On iOS, I think I have a couple of considerable options including programmable blending and thread group sharing. Unfortunately, those are not available on macOS.
I appreciate any general tip/direction to approach the problem on macOS, either thread synchronization or image histogram.