What is an efficient way to broadcast a value to all the threads after using blockReduce from Cuda Unbound?

Question

let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)

if threadIdx.x = 0 then 
    means.[col] <- column_mean_partial
    column_mean_shared := column_mean_partial

__syncthreads()

let column_mean = !column_mean_shared

The above code snippet calculates the mean of every column of a 2d matrix. As only thread 0 of a block has the full value I store it to shared memory column_mean_shared, use __syncthreads() and then broadcast it to all the threads in a block as I need them to have that value in order to calculate the variance.

Would there be a better way to broadcast the value or is the above efficient enough already?

score 1 · Answer 1 · answered Aug 26 '15 at 07:50

I hadn't expected much when I posted this question, but it turns out for large matrices such as 128x10000 for example there is a much better way. I wrote a warpReduce kernel that has the block size of 32, which allows it to do the whole reduction using shuffle xor.

For a 128x100000 for 100 iterations the first version that used 64 blocks per grid (and 32 threads per block) took 0.5s. For the the CUB row reduce it took 0.25s.

When I increased the blocks per grid to 256, I got a nearly 4x speedup to about 1.5s. At 384 blocks per thread it takes 1.1s and increasing the number of blocks does not seem to improve performance from there.

For the problem sizes that I am interested in, the improvements are not nearly as dramatic.

For the 128x1024 and 128x512 cases, 10000 iterations:

For 1024: 1s vs 0.82s in favor of warpReduce. For 512: 0.914s vs 0.873s in favor of warpReduce.

For small matrices, any speedups from parallelism are eaten away by kernel launch times it seems.

For 256: 0.94s vs 0.78s in favor of warpReduce. For 160: 0.88s vs 0.77s in favor of warpReduce.

It was tested using a GTX 970.

It is likely that for the Kepler and earlier nVidia cards the figures would have been different as in Maxwell the block limit per grid was raised from 32 to 64 per SM which improves multiprocessor occupancy.

I am satisfied with this as the performance improvement is nice and I actually hadn't been able to write a kernel that uses shared memory correctly before reaching for the Cub block reduce. I forgot how painful Cuda can be sometimes. It is amazing that the version that uses no shared memory is so competitive.

Here are the two modules that I tested with:

type rowModule(target) =
    inherit GPUModule(target)

    let grid_size = 64
    let block_size = 128

    let blockReducer = BlockReduce.RakingCommutativeOnly<float32>(dim3(block_size,1,1),worker.Device.Arch)

    [<Kernel;ReflectedDefinition>]
    member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
        // Point block_start to where the column starts in the array.
        let mutable col = blockIdx.x
        let temp_storage = blockReducer.TempStorage.AllocateShared()

        let column_mean_shared = __shared__.Variable()

        while col < num_cols do
            // i is the row index
            let mutable row = threadIdx.x
            let mutable acc = 0.0f
            while row < num_rows do
                // idx is the absolute index in the array
                let idx = row + col * num_rows
                acc <- acc + x.[idx]
                // Increment the row index
                row <- row + blockDim.x

            let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)

            if threadIdx.x = 0 then 
                means.[col] <- column_mean_partial
                column_mean_shared := column_mean_partial

            __syncthreads()

            let column_mean = !column_mean_shared

            row <- threadIdx.x
            acc <- 0.0f
            while row < num_rows do
                // idx is the absolute index in the array
                let idx = row + col * num_rows
                // Accumulate the variances.
                acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
                // Increment the row index
                row <- row + blockDim.x

            let variance_sum = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
            if threadIdx.x = 0 then stds.[col] <- sqrt(variance_sum)

            col <- col + gridDim.x

    member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
        let lp = LaunchParam(grid_size, block_size)
        this.GPULaunch <@ this.Kernel @> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr

type rowWarpModule(target) =
    inherit GPUModule(target)

    let grid_size = 384
    let block_size = 32

    [<Kernel;ReflectedDefinition>]
    member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
        // Point block_start to where the column starts in the array.
        let mutable col = blockIdx.x

        while col < num_cols do
            // i is the row index
            let mutable row = threadIdx.x
            let mutable acc = 0.0f
            while row < num_rows do
                // idx is the absolute index in the array
                let idx = row + col * num_rows
                acc <- acc + x.[idx]
                // Increment the row index
                row <- row + blockDim.x

            let inline butterflyWarpReduce (value:float32) = 
                let v1 = value + __shfl_xor value 16 32
                let v2 = v1 + __shfl_xor v1 8 32
                let v3 = v2 + __shfl_xor v2 4 32
                let v4 = v3 + __shfl_xor v3 2 32
                v4 + __shfl_xor v4 1 32

            let column_mean = (butterflyWarpReduce acc) / (float32 num_rows)

            row <- threadIdx.x
            acc <- 0.0f
            while row < num_rows do
                // idx is the absolute index in the array
                let idx = row + col * num_rows
                // Accumulate the variances.
                acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
                // Increment the row index
                row <- row + blockDim.x

            let variance_sum = (butterflyWarpReduce acc) / (float32 num_rows)
            if threadIdx.x = 0 
            then stds.[col] <- sqrt(variance_sum)
                 means.[col] <- column_mean

            col <- col + gridDim.x

    member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
        let lp = LaunchParam(grid_size, block_size)
        this.GPULaunch <@ this.Kernel @> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr

What is an efficient way to broadcast a value to all the threads after using blockReduce from Cuda Unbound?

1 Answers1