12

I am trying to compute sum of large array in parallel with metal swift.

Is there a god way to do it?

My plane was that I divide my array to sub arrays, compute sum of one sub arrays in parallel and then when parallel computation is finished compute sum of sub sums.

for example if I have

array = [a0,....an] 

I divide array in sub arrays :

array_1 = [a_0,...a_i],
array_2 = [a_i+1,...a_2i],
....
array_n/i = [a_n-1, ... a_n]

sums for this arrays is computed in parallel and I get

sum_1, sum_2, sum_3, ... sum_n/1

at the end just compute sum of sub sums.

I create application which run my metal shader, but some things I don't understand quite.

        var array:[[Float]] = [[1,2,3], [4,5,6], [7,8,9]]

        // get device
        let device: MTLDevice! = MTLCreateSystemDefaultDevice()

        // get library
        let defaultLibrary:MTLLibrary! = device.newDefaultLibrary()

        // queue
        let commandQueue:MTLCommandQueue! = device.newCommandQueue()

        // function
        let kernerFunction: MTLFunction! = defaultLibrary.newFunctionWithName("calculateSum")

        // pipeline with function
        let pipelineState: MTLComputePipelineState! = try device.newComputePipelineStateWithFunction(kernerFunction)

        // buffer for function
        let commandBuffer:MTLCommandBuffer! = commandQueue.commandBuffer()

        // encode function
        let commandEncoder:MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()

        // add function to encode
        commandEncoder.setComputePipelineState(pipelineState)

        // options
        let resourceOption = MTLResourceOptions()

        let arrayBiteLength = array.count * array[0].count * sizeofValue(array[0][0])

        let arrayBuffer = device.newBufferWithBytes(&array, length: arrayBiteLength, options: resourceOption)

        commandEncoder.setBuffer(arrayBuffer, offset: 0, atIndex: 0)

        var result:[Float] = [0,0,0]

        let resultBiteLenght = sizeofValue(result[0])

        let resultBuffer = device.newBufferWithBytes(&result, length: resultBiteLenght, options: resourceOption)

        commandEncoder.setBuffer(resultBuffer, offset: 0, atIndex: 1)

        let threadGroupSize = MTLSize(width: 1, height: 1, depth: 1)

        let threadGroups = MTLSize(width: (array.count), height: 1, depth: 1)

        commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)

        commandEncoder.endEncoding()

        commandBuffer.commit()

        commandBuffer.waitUntilCompleted()

        let data = NSData(bytesNoCopy: resultBuffer.contents(), length: sizeof(Float), freeWhenDone: false)

        data.getBytes(&result, length: result.count * sizeof(Float))

        print(result)

is my Swift code,

my shader is :

kernel void calculateSum(const device float *inFloat [[buffer(0)]],
                     device float *result [[buffer(1)]],
                     uint id [[ thread_position_in_grid ]]) {


    float * f = inFloat[id];
    float sum = 0;
    for (int i = 0 ; i < 3 ; ++i) {
        sum = sum + f[i];
    }

    result = sum;
}

I don't know how to defined that inFloat is array of array. I don't know exactly what is threadGroupSize and threadGroups. I don't know what is device and uint in shader properties.

Is this right approach?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Marko Zadravec
  • 8,298
  • 10
  • 55
  • 97

3 Answers3

27

I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:

let count = 10_000_000
let elementsPerSum = 10_000

// Data type, has to be the same as in the shader
typealias DataType = CInt

let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)

// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum

// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)

let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!

encoder.setComputePipelineState(pipeline)

encoder.setBuffer(dataBuffer, offset: 0, index: 0)

encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)

// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)

encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()

var start, end : UInt64
var result : DataType = 0

start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
    result += elem
}

end = mach_absolute_time()

print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
    for elem in buffer {
        result += elem
    }
}
end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")

I used my Mac to test it, but it should work just fine on iOS.

Output:

Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018

The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.

Regexident
  • 29,441
  • 10
  • 93
  • 100
Kametrixom
  • 14,673
  • 7
  • 45
  • 62
  • Thanks, this is really great. Only one question. If I understand this shared correctly, the sub sum for each resultIndex will be calculated 1000 times (the same result)? Or I don't understand how shader is called sequentially.? – Marko Zadravec Jul 03 '16 at 05:17
  • @MarkoZadravec 10.000 is the amount of elements each individual task sums up. This means if the data size is 1.000.000 there's gonna be 100 tasks. The results array holds all results from these 100 tasks sums. The shader only knows it's task number (`resultIndex`) and how many items it needs to sum up (`elementsPerSum`), where the task number says where in the results buffer it needs to store the sum and by multiplying those two numbers it gets the first of those 10000 elements it needs to sum up. I know this is not the best explanation, maybe you can read some articles to find out more – Kametrixom Jul 03 '16 at 05:25
  • No, I understand the arithmetics how you get start and end position. (and I mistaken 100 for 10000) My question is, each block (thread) is size 10.000. And inside block I calculate start and end position and calculate sum. But is it shader called only once per block, or it is called for each element in block? Because in this case we would calculate one sub sum 10.000 times (result will always be the same, because we calculated start and end position the sam). If this is true, wouldn't be better to have instead array of integers, a array of array of integers? – Marko Zadravec Jul 03 '16 at 05:35
  • @MarkoZadravec the shader is called once per subsum/block with its corresponding index. This happens with the `dispatch` method, which dispatches in total threadgroups * threadsPerThreadgroup threads and giving each one of them their threadgroup index and thread_in_threadgroup index. An array of an array is the same as a single array, since memory is just linear, not 2-dimensional. Also in Swift, an array is just a reference, so this wouldn't even work. – Kametrixom Jul 03 '16 at 05:41
  • Ok thanks, I was confused, because you get whole array and indexes (and because you can't print to log in metal shader) and I thought that it is called n-times. – Marko Zadravec Jul 03 '16 at 05:44
  • Is there a reason that you use CUnsignedInt? If I try to change it to either Int or CInt I get wrong sum from GPU. Is there a reason for this? – Marko Zadravec Jul 03 '16 at 16:53
  • 1
    @MarkoZadravec `DataType` has to be the same type as the shader uses => If you change it to e.g. CInt, you need to change it to `int` in the shader as well. Remember to always use the Swift equivalents with `C` prefix, as Metal uses C data types. I updated my answer with `CInt` and fixed the types, have a look at [the changes](http://stackoverflow.com/posts/38165750/revisions) – Kametrixom Jul 03 '16 at 17:16
  • Kametrixom, thank you for your answer. I have some additional questions regarding this : http://stackoverflow.com/questions/38232640/swift-metal-parallel-sum-calculation-of-array-on-ios. – Marko Zadravec Jul 06 '16 at 19:55
  • You might get better performance if you change the line `sums[resultIndex] += data[dataIndex]; ` to sum the values in a local variable and then write that sum to `sums` in only one write operation. You would have less memory accesses. – Mathias Claassen Dec 07 '16 at 16:35
  • using multithreaded vector sum with a 8 thread cpu my timings are: Metal lap time: 177.420020 cpu MT lap time: 42.018056, cpu = 4 times faster but using gpu matrix arithmetic gpu it 20 times faster in equivalent operation – roberto Mar 28 '19 at 18:08
7

The accepted answer is annoyingly missing the kernel that was written for it. The source is here, but here is the full program and shader that can be run as a swift command line application.

/*
 * Command line Metal Compute Shader for data processing
 */

import Metal
import Foundation
//------------------------------------------------------------------------------
let count = 10_000_000
let elementsPerSum = 10_000
//------------------------------------------------------------------------------
typealias DataType = CInt // Data type, has to be the same as in the shader
//------------------------------------------------------------------------------
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
//------------------------------------------------------------------------------
// Our data, randomly generated:
var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
var dataCount = CUnsignedInt(count)
var elementsPerSumC = CUnsignedInt(elementsPerSum)
// Number of individual results = count / elementsPerSum (rounded up):
let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
//------------------------------------------------------------------------------
// Our data in a buffer (copied):
let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
//------------------------------------------------------------------------------
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
//------------------------------------------------------------------------------
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
//------------------------------------------------------------------------------
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)

// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
//------------------------------------------------------------------------------
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
//------------------------------------------------------------------------------
var start, end : UInt64
var result : DataType = 0
//------------------------------------------------------------------------------
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
    result += elem
}

end = mach_absolute_time()
//------------------------------------------------------------------------------
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
result = 0

start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
    for elem in buffer {
        result += elem
    }
}
end = mach_absolute_time()

print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
//------------------------------------------------------------------------------
#include <metal_stdlib>
using namespace metal;

typedef unsigned int uint;
typedef int DataType;

kernel void parsum(const device DataType* data [[ buffer(0) ]],
                   const device uint& dataLength [[ buffer(1) ]],
                   device DataType* sums [[ buffer(2) ]],
                   const device uint& elementsPerSum [[ buffer(3) ]],
                   
                   const uint tgPos [[ threadgroup_position_in_grid ]],
                   const uint tPerTg [[ threads_per_threadgroup ]],
                   const uint tPos [[ thread_position_in_threadgroup ]]) {
    
    uint resultIndex = tgPos * tPerTg + tPos;
    
    uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
    uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end
    
    for (; dataIndex < endIndex; dataIndex++)
        sums[resultIndex] += data[dataIndex];
}

Objective-C

The same Swift command-line programme, but in Objective-C

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>

typedef int DataType;

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        unsigned int count = 10000000;
        unsigned int elementsPerSum = 10000;
        //----------------------------------------------------------------------
        id<MTLDevice> device  = MTLCreateSystemDefaultDevice();
        id<MTLLibrary>library = [device newDefaultLibrary];
        
        id<MTLFunction>parsum = [library newFunctionWithName:@"parsum"];
        id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:parsum error:nil];
        //----------------------------------------------------------------------
        DataType* data = (DataType*) malloc(sizeof(DataType) * count);
        for (int i = 0; i < count; i++){
            data[i] = arc4random_uniform(100);
        }
        unsigned int dataCount = count;
        unsigned int elementsPerSumC = elementsPerSum;
        unsigned int resultsCount = (count + elementsPerSum - 1) / elementsPerSum;
        //------------------------------------------------------------------------------
        id<MTLBuffer>dataBuffer = [device newBufferWithBytes:data
                                                      length:(sizeof(int) * count)
                                                     options:MTLResourceStorageModeManaged];
        
        id<MTLBuffer>resultsBuffer = [device newBufferWithLength:(sizeof(int) * count)
                                                         options:0];
        
        DataType* results = resultsBuffer.contents;
        //----------------------------------------------------------------------
        id<MTLCommandQueue>queue = [device newCommandQueue];
        id<MTLCommandBuffer>cmds = [queue commandBuffer];
        id<MTLComputeCommandEncoder> encoder = [cmds computeCommandEncoder];
        //----------------------------------------------------------------------
        [encoder setComputePipelineState:pipeline];
        [encoder setBuffer:dataBuffer offset:0 atIndex:0];
        [encoder setBytes:&dataCount length:sizeof(unsigned int) atIndex:1];
        [encoder setBuffer:resultsBuffer offset:0 atIndex:2];
        [encoder setBytes:&elementsPerSumC length:sizeof(unsigned int) atIndex:3];
        //----------------------------------------------------------------------
        MTLSize threadgroupsPerGrid =
        {
            (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth,
            1,
            1
        };
        
        MTLSize threadsPerThreadgroup =
        {
            pipeline.threadExecutionWidth,
            1,
            1
        };
        //----------------------------------------------------------------------
        [encoder dispatchThreadgroups:threadgroupsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
        [encoder endEncoding];
        //----------------------------------------------------------------------
        uint64_t start, end;
        DataType result = 0;
        
        start = mach_absolute_time();
        [cmds commit];
        [cmds waitUntilCompleted];

        for (int i = 0; i < resultsCount; i++){
            result += results[i];
        }

        end = mach_absolute_time();

        NSLog(@"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));
        //----------------------------------------------------------------------
        result = 0;

        start = mach_absolute_time();

        for (int i = 0; i < count; i++){
            result += data[i];
        }

        end = mach_absolute_time();
        NSLog(@"Metal Result %d. time %f", result, (float)(end - start)/(float)(NSEC_PER_SEC));

        //------------------------------------------------------------------------------
        free(data);
    }
    return 0;
}

fdcpp
  • 1,677
  • 1
  • 14
  • 25
  • Thank you for this! I found that you now need to use `let device = MTLCopyAllDevices()[0]` when setting the device because the compiler complains that the program is non-interactive now when using `MTLCreateSystemDefaultDevice()`. – Blark Dec 20 '22 at 18:00
-3

i've been running the app. on a gt 740 (384 cores) vs. i7-4790 with a multithreader vector sum implementation and here are my figures:

Metal lap time: 19.959092
cpu MT lap time: 4.353881

that's a 5/1 ratio for cpu, so unless you have a powerful gpu using shaders is not worth it.

i've been testing the same code in a i7-3610qm w/ igpu intel hd 4000 and surprisely results are much better for metal: 2/1

edited: after tweaking with thread parameter i've finally improved gpu performance, now it's upto 16xcpu

roberto
  • 577
  • 6
  • 5