When working with Metal Shaders / Compute Kernels on iOS or MacOS...
MTLComputePipelineState
has a limit of maxTotalThreadsPerThreadgroup
.
This limit can be queried after the pipeline state is created. This limit is dependent on both GPU hardware characteristics, OS version, and your Metal kernel code.
- What aspects of Metal kernel code impact MTLComputePipelineState's maxTotalThreadsPerThreadgroup?
- What can be done to increase the value given a fixed hardware / OS combination?
For example:
- Register usage?
- Length of code?
- Forced inlining?
(The question isn't how to calculate the optimal sizes, it's about how to modify code to achieve the largest threadgroup.)
Link to Apple's docs for MTLComputePipelineState
:
https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/1414927-maxtotalthreadsperthreadgroup
Link to Apple's docs for "Calculating Threadgroup and Grid Sizes": https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes?language=objc