0

I have a kernel Metal function which basically looks like this:

struct Matrix {
    half arr[562500]; //enough to store 750x750 matrix
};

struct Output {
    half arr[12288];
};


kernel void compute_features(device Output& buffer [[ buffer(0) ]],
                             const device Matrix& mtx_0 [[ buffer(1) ]],
                             const device Matrix& mtx_1 [[ buffer(2) ]],
                             constant short2& matSize [[ buffer(3) ]],
                             constant float& offset [[ buffer(4) ]],
                             ushort2 gid [[ thread_position_in_grid ]]) {


for (int i = 0; i < 12; i++) {
    for (int j = 0; j < 12; j++) {

        int mat_id = i * matSize.x + j;

        half matrixValue_0 = mtx_0.mat[mat_id];
        half matrixValue_1 = mtx_1.mat[mat_id] - offset;     

        short someId_0 = 0;
        short someId_1 = 0;
        short someId_2 = 0;
        short someId_3 = 0;  //those ids will be calculated at the code below
        half value = 0.h;  //this value will be calculated at the code below

        //some math where `someId` and `value` are calculated with usage of `matrixValue_0` and `matrixValue_1`

        if (some_condition0) {
            buffer.arr[someId_0] += value;
        }

        if (some_condition1) {
            buffer.arr[someId_1] += value;
        }

        if (some_condition2) {
            buffer.arr[someId_2] += value;
        }

        if (some_condition3) {
            buffer.arr[someId_3] += value;
        }
     }
}

I understand that this code has its down-sides - dynamic indexing and big loop. But unfortunately the algorithm I'm trying to express can not be implemented differently at that point.

Now, this code runs very good at iPhone 7+, it takes around 200 us per iteration, and I'm very happy with this number.

BUT, I tried to run the exact same algorithm on iPhone XR and I was surprised to see that this algorithm takes around 1.0-1.2 ms to complete.

With the help of XCode and it's magnificent GPU pipeline debugging tool I found out that my bottlenecks are:

1)

    half matrixValue_0 = mtx_0.mat[mat_id];
    half matrixValue_1 = mtx_1.mat[mat_id] - offset;

It seems that significant part of processing time are spent in Memory Load operation.

2)

if (some_condition0) {
    buffer[someId_0] += value;
}

if (some_condition1) {
    buffer[someId_1] += value;
}

if (some_condition2) {
    buffer[someId_2] += value;
}

if (some_condition3) {
    buffer[someId_3] += value;
}

The major processing time are spent for Memory Store operation.

For me it seems like iPhone XR quite struggles operating with device memory because bottle-necks are in places where I work with containers from device memory.

I understand that I'm using dynamic indexing - compiler can not really predict what address in the container will be loaded/stored in certain iteration. But the code works very good on iPhone 7+, but not on iPhone XR.

I suspect that it might have something to do with byte alignment. Can it be somehow related to that?

I would love to hear some suggestions on this. Thanks in advance!

Eugene Alexeev
  • 1,152
  • 12
  • 32
  • If you suspect that byte alignment plays a role, have you tried different alignments on your buffer? Also, have you played around with different threadgroup sizes? I'm not a Metal expert but one thing that I would maybe try is `buffer[someId_x] += value * condition;` instead of branching to avoid Metal having to execute unnecessary code paths. – Palle Dec 05 '19 at 20:54
  • Hello @Palle and thank you for your response! I suspected it but it seems that `Output` and `Matrix` are byte-aligned containers. So now I'm not sure that my suspicion is correct. It must be some different thing – Eugene Alexeev Dec 06 '19 at 08:12
  • Here is something you might try. Could you change the logic so that the write operations are not conditional? That might help with performance. What I mean is have a 4 local vars that are zero by default and then these are conditionally set in the if blocks. Then unconditionally increment the 4 writes so that when the if blocks are false the output is simply incremented by zero. – MoDJ Jan 11 '20 at 23:30

0 Answers0