I hadn't expected much when I posted this question, but it turns out for large matrices such as 128x10000 for example there is a much better way. I wrote a warpReduce kernel that has the block size of 32, which allows it to do the whole reduction using shuffle xor.
For a 128x100000 for 100 iterations the first version that used 64 blocks per grid (and 32 threads per block) took 0.5s. For the the CUB row reduce it took 0.25s.
When I increased the blocks per grid to 256, I got a nearly 4x speedup to about 1.5s. At 384 blocks per thread it takes 1.1s and increasing the number of blocks does not seem to improve performance from there.
For the problem sizes that I am interested in, the improvements are not nearly as dramatic.
For the 128x1024 and 128x512 cases, 10000 iterations:
For 1024: 1s vs 0.82s in favor of warpReduce.
For 512: 0.914s vs 0.873s in favor of warpReduce.
For small matrices, any speedups from parallelism are eaten away by kernel launch times it seems.
For 256: 0.94s vs 0.78s in favor of warpReduce.
For 160: 0.88s vs 0.77s in favor of warpReduce.
It was tested using a GTX 970.
It is likely that for the Kepler and earlier nVidia cards the figures would have been different as in Maxwell the block limit per grid was raised from 32 to 64 per SM which improves multiprocessor occupancy.
I am satisfied with this as the performance improvement is nice and I actually hadn't been able to write a kernel that uses shared memory correctly before reaching for the Cub block reduce. I forgot how painful Cuda can be sometimes. It is amazing that the version that uses no shared memory is so competitive.
Here are the two modules that I tested with:
type rowModule(target) =
inherit GPUModule(target)
let grid_size = 64
let block_size = 128
let blockReducer = BlockReduce.RakingCommutativeOnly<float32>(dim3(block_size,1,1),worker.Device.Arch)
[<Kernel;ReflectedDefinition>]
member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
// Point block_start to where the column starts in the array.
let mutable col = blockIdx.x
let temp_storage = blockReducer.TempStorage.AllocateShared()
let column_mean_shared = __shared__.Variable()
while col < num_cols do
// i is the row index
let mutable row = threadIdx.x
let mutable acc = 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
acc <- acc + x.[idx]
// Increment the row index
row <- row + blockDim.x
let column_mean_partial = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then
means.[col] <- column_mean_partial
column_mean_shared := column_mean_partial
__syncthreads()
let column_mean = !column_mean_shared
row <- threadIdx.x
acc <- 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
// Accumulate the variances.
acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
// Increment the row index
row <- row + blockDim.x
let variance_sum = blockReducer.Reduce(temp_storage, acc, fun a b -> a + b) / (float32 num_rows)
if threadIdx.x = 0 then stds.[col] <- sqrt(variance_sum)
col <- col + gridDim.x
member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
let lp = LaunchParam(grid_size, block_size)
this.GPULaunch <@ this.Kernel @> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr
type rowWarpModule(target) =
inherit GPUModule(target)
let grid_size = 384
let block_size = 32
[<Kernel;ReflectedDefinition>]
member this.Kernel (num_rows:int) (num_cols:int) (x:deviceptr<float32>) (means:deviceptr<float32>) (stds:deviceptr<float32>) =
// Point block_start to where the column starts in the array.
let mutable col = blockIdx.x
while col < num_cols do
// i is the row index
let mutable row = threadIdx.x
let mutable acc = 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
acc <- acc + x.[idx]
// Increment the row index
row <- row + blockDim.x
let inline butterflyWarpReduce (value:float32) =
let v1 = value + __shfl_xor value 16 32
let v2 = v1 + __shfl_xor v1 8 32
let v3 = v2 + __shfl_xor v2 4 32
let v4 = v3 + __shfl_xor v3 2 32
v4 + __shfl_xor v4 1 32
let column_mean = (butterflyWarpReduce acc) / (float32 num_rows)
row <- threadIdx.x
acc <- 0.0f
while row < num_rows do
// idx is the absolute index in the array
let idx = row + col * num_rows
// Accumulate the variances.
acc <- acc + (x.[idx]-column_mean)*(x.[idx]-column_mean)
// Increment the row index
row <- row + blockDim.x
let variance_sum = (butterflyWarpReduce acc) / (float32 num_rows)
if threadIdx.x = 0
then stds.[col] <- sqrt(variance_sum)
means.[col] <- column_mean
col <- col + gridDim.x
member this.Apply((dmat: dM), (means: dM), (stds: dM)) =
let lp = LaunchParam(grid_size, block_size)
this.GPULaunch <@ this.Kernel @> lp dmat.num_rows dmat.num_cols dmat.dArray.Ptr means.dArray.Ptr stds.dArray.Ptr