I am trying to code C# versions (in C# style) of the F# reduce functions found here:
https://github.com/quantalea/AleaGPUTutorial/tree/master/src/fsharp/examples/generic_reduce
More specific to my question, take this function for example:
let multiReduce (opExpr:Expr<'T -> 'T -> 'T>) numWarps =
let warpStride = WARP_SIZE + WARP_SIZE / 2 + 1
let sharedSize = numwarps * warpStride
<@ fun tid (x:'T) ->
// stuff
@>
I'm primarily an F# guy, and I'm not quite sure how I should go about coding functions like these in C#. For the C# version, the multiReduce function will be a class member. So if I wanted to do a more direct translation of the F# code, I would return a Func from my MultiReduce member.
The other option would be to "flatten" the multiReduce function, so that my C# member version would have two extra parameters. So...
public T MultiReduce(Func<T,T,T> op, int numWarps, int tid, T x)
{
// stuff
}
But I don't think this would work for AleaGPU coding in all cases because the quoted expression in the F# version is a device function. You need the nested function structure to be able to separate the assignment of certain variables from the actual invocation of the function.
Another way I see to do it would be to make a MultiReduce class and have the opExpr and numWarps as fields, then make the function in the quotation a class member.
So how are higher order functions like these generally implemented in AleaGPU-C#? I don't think it's good to return Func<..> everywhere since I don't see this done much in C# coding. Is AleaGPU a special case where this would be ok?
A basic AleaGPU C# implementation looks like this:
internal class TransformModule<T> : ILGPUModule
{
private readonly Func<T, T> op;
public TransformModule(GPUModuleTarget target, Func<T, T> opFunc)
: base(target)
{
op = opFunc;
}
[Kernel]
public void Kernel(int n, deviceptr<T> x, deviceptr<T> y)
{
var start = blockIdx.x * blockDim.x + threadIdx.x;
var stride = gridDim.x * blockDim.x;
for (var i = start; i < n; i += stride)
y[i] = op(x[i]);
}
public void Apply(int n, deviceptr<T> x, deviceptr<T> y)
{
const int blockSize = 256;
var numSm = this.GPUWorker.Device.Attributes.MULTIPROCESSOR_COUNT;
var gridSize = Math.Min(16 * numSm, Common.divup(n, blockSize));
var lp = new LaunchParam(gridSize, blockSize);
GPULaunch(Kernel, lp, n, x, y);
}
public T[] Apply(T[] x)
{
using (var dx = GPUWorker.Malloc(x))
using (var dy = GPUWorker.Malloc<T>(x.Length))
{
Apply(x.Length, dx.Ptr, dy.Ptr);
return dy.Gather();
}
}
}