1

I am trying to speed up my current implementation of a function that converts [UInt32] to a [UInt8] which in turn is split up into [[UInt8]] with 6 arrays at each index.

My implementation:

extension Array {
func splitBy(subSize: Int) -> [[Element]] {
    return 0.stride(to: self.count, by: subSize).map { startIndex in
        let endIndex = startIndex.advancedBy(subSize, limit: self.count)
        return Array(self[startIndex ..< endIndex])
    }
  }
}



func convertWordToBytes(fullW : [UInt32]) -> [[UInt8]] {
    var combined8 = [UInt8]()

    //Convert 17 [UInt32] to 68 [UInt8]
    for i in 0...16{
        _ = 24.stride(through: 0, by: -8).map {
            combined8.append(UInt8(truncatingBitPattern: fullW[i] >> UInt32($0)))
        }
    }

    //Split [UInt8] to [[UInt8]] with 6 values at each index.
    let combined48 = combined8.splitBy(6) 

    return combined48
}

This function will be iterated millions of times in my program and its speed is a huge burden.

Anyone got any ideas? Thanks

p0ppy
  • 13
  • 3
  • 1
    You may prefer to post this at http://codereview.stackexchange.com – rmaddy Oct 28 '16 at 16:08
  • Your code is in Swift 2. Do you want to maintain it as Swift 2 or update to Swift 3 at the same time? – Code Different Oct 28 '16 at 16:59
  • This computer is too old so need to keep it in Swift 2 as for now. – p0ppy Oct 28 '16 at 18:01
  • use vector math/simd libs in accelerate framework you might be able hook in the broken but useable opencl implementation. this kernel would be small and crush uint8s easily in the pipeline much faster than cpu threading – μολὼν.λαβέ Mar 14 '17 at 00:12

1 Answers1

1

If you profile (Cmd + I) your code, you will see that most of the time is on various "copy to buffer" functions. This happens when you append a new element to the array but it has run out of its initial allocated space so it must be moved to a location on the heap with more memory. Morals of the lesson: heap allocation is slow but unavoidable with arrays. Do it as few times as possible.

Try this:

func convertWordToBytes2(fullW: [UInt32]) -> [[UInt8]] {
    let subSize = 6

    // We allocate the array only once per run since allocation is so slow
    // There will only be assignment to it after
    var combined48 = [UInt8](count: fullW.count * 4, repeatedValue: 0).splitBy(subSize)

    var row = 0
    var col = 0

    for i in 0...16 {
        for j in 24.stride(through: 0, by: -8) {
            let value = UInt8(truncatingBitPattern: fullW[i] >> UInt32(j))
            combined48[row][col] = value

            col += 1
            if col >= subSize {
                row += 1
                col = 0
            }
        }
    }

    return combined48
}

Benchmark code:

let testCases = (0..<1_000_000).map { _ in
    (0..<17).map { _ in arc4random() }
}

testCases.forEach {
    convertWordToBytes($0)
    convertWordToBytes2($0)
}

Result (on my 2012 iMac)

Weight          Self Weight         Symbol Name
9.35 s   53.2%  412.00 ms           specialized convertWordToBytes([UInt32]) -> [[UInt8]]
3.28 s   18.6%  344.00 ms           specialized convertWordToBytes2([UInt32]) -> [[UInt8]]

By eliminating multiple allocations, we already reduced the run time by 60%. But each test case is independent, which lends itself perfectly to parallel processing with today's multi-core CPU. A modified loop...:

dispatch_apply(testCases.count, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0)) { i in
    convertWordToBytes2(testCases[i])
}

... will shave about 1 second off the wall time when executed on my quad-core i7 with 8 threads:

Weight    Self Weight       Symbol Name
2.28 s    6.4%  0 s         _dispatch_worker_thread3  0x58467
2.24 s    6.3%  0 s         _dispatch_worker_thread3  0x58463
2.22 s    6.2%  0 s         _dispatch_worker_thread3  0x58464
2.21 s    6.2%  0 s         _dispatch_worker_thread3  0x58466
2.21 s    6.2%  0 s         _dispatch_worker_thread3  0x58465
2.21 s    6.2%  0 s         _dispatch_worker_thread3  0x58461
2.18 s    6.1%  0 s         _dispatch_worker_thread3  0x58462

The time saving is not as much as I hoped for. Apparently there's some contention when accessing the heap memory. For anything even faster, you should explore a C-based solution.

Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Thank you very much for doing this! This made my code so much faster. Special thanks for the benchmarking and explanations. – p0ppy Oct 31 '16 at 10:41