0

I am writing an application and eventually it comes to well parallelisable part:

two dimensional float initialData and result arrays
for each cell (a, b) in result array:
    for each cell (i, j) in initialData:
        result(a, b) += someComputation(initialData(i, j), a, b, i, j, some global data...);

Some more details about algorithm:

  • I'd like to make the first loop's iterations to run concurrently (perhaps there is better approach?)
  • Initial data is accessed in read-only way
  • someComputation is fairly simple, it involves multiplication, addition, cosine computing, so it could be accomplished by GPU, however, it needs the indexes of elements it is currently working on
  • Arrays won't exceed ~4000 in any dimension

Library properties:

  • Program is going to be written in C# (with WPF), so it would be nice if it (already) had easy-to-use .NET bindings
  • If there is no GPU found, algorithm should run on CPU
  • Program is going to be Windows-only and Windows XP support is highly preferable.
  • Algorithm can be rewritten in OpenCL, however, I believe it is not as widely supported as pixel shaders. But, if there are no alternatives, OpenCL would be fine. (AFAIK CUDA runs only on nVidia GPU's and OpenCL covers both nVidia's and AMD's GPU's)

I have tried to look at Microsoft Accelerator library, but I haven't found a way to pass in array indexes. Any help would be apprectiated and excuse me for my english.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • "however, I believe it is not as widely supported as pixel shaders" - well that depends: If you want to support old hardware (pre directX10 that is true), however you can run OpenCL on the cpu, so writing your code to fallback to cpu is pretty simple (of course using exactly the same codepath is likely to be suboptimal, but still (besides the most critical path probably needs different implementations for AMD and NVIDIA anyways (maybe different generations too)). – Grizzly Jan 12 '12 at 15:55
  • Once you get that working, don't forget CPUs and GPUs are different. CPUs want a more course grained threading because they have few hardware threads and GPUs want more fine grained threading because they have lots of hardware threads. This means you may want to tweak your FOR loops depending on which arch. – Bengie Jan 12 '12 at 16:00
  • @Grizzly, I would like to support as wider range of hardware as possible (with less coding of course or with an easy way to determine the hardware program is running on to choose the variation of algorithm). –  Jan 12 '12 at 16:07
  • @Bengie, this algorithm cannot be optimized, so the if ran on CPU it will simply degrade (because iterations will run sequentially), whereas on GPU it will work faster (because there might be some concurrent iterations). –  Jan 12 '12 at 16:10
  • @EdgeLuxe: I don't know how exactly your computation looks, but I would suggest staying away from gpgpu on pre DX10 hardware, since it's rarely worth the hassle (for good performance you might end up writing a code path for every single DX9 generation and still get only very moderate speedups). The point about the tweaking was that you can't run the same code on different hardware platforms and have it be optimal on all of them. It might run on all of them, but for some the performance might be quite bad. – Grizzly Jan 12 '12 at 16:24
  • @Grizzly, guess you are right - leaving "CPU version" for older hardware would be fine. Well, maybe even if the code is "not optimal" for some hardware, in this case it would still run in less or approx. same time as if it was run on CPU, wouldn't it? –  Jan 12 '12 at 16:37
  • @EdgeLuxe: Not necessarily, writing efficient code for older gpus isn't easy. For "GPU" not optimal can easily mean orders of magnitude. Even for modern gpus you can easily end up with your code running not faster on the gpu then the cpu if you make some small "mistakes" ("mistake" considering the performance, not the correctness). These "mistakes" are different for different hardwaregenerations, but older hardware generally has more restrictions (and is slower relatively to the cpu), so it can easily get an order of magnitude slower then the cpu for dx9 class hardware – Grizzly Jan 12 '12 at 16:44
  • Running GPU optimized code on the CPU is typically not that bad, but you often need to do redundant operations on the GPU (initialization and such) and often have to use less optimal algorithms to allow for massive parallelization, so it can still be much slower then a CPU specific codepath – Grizzly Jan 12 '12 at 16:47
  • I have made a little test and it looks like OpenCL version works faster on my 8600GTS than CPU. Hopefully, I will be able to write same code that suits both CPU and GPU. –  Jan 12 '12 at 17:21
  • For easy of use from within C# these two look promising (to me ;-) - Cudafy and Tidepowrd GPU.NET – IvoTops Aug 14 '12 at 11:58

1 Answers1

0

There is low-level OpenCL bindings: OpenCL.NET: http://openclnet.codeplex.com/. Also, exists OpenCL.NET based bindings for F#: https://github.com/YaccConstructor/Brahma.FSharp

It allows you to write "native" F# code and run it on GPU via OpenCL. For example, code for matrix multiplication (without provider configuration):

//Code for run on GPU
let command = 
    <@
        fun (r:_2D) columns (a:array<_>) (b:array<_>) (c:array<_>) -> 
            let tx = r.GlobalID0
            let ty = r.GlobalID1
            let mutable buf = c.[ty * columns + tx]
            for k in 0 .. columns - 1 do
                buf <- buf + (a.[ty * columns + k] * b.[k * columns + tx])
            c.[ty * columns + tx] <- buf
    @>

//compile code and configure kernel
let kernel, kernelPrepare, kernelRun = provider.Compile command
let d =(new _2D(rows, columns, localWorkSize, localWorkSize))
kernelPrepare d columns aValues bValues cParallel
//run computations on GPU
let _ = commandQueue.Add(kernelRun()).Finish()            

//read result back
let _ = commandQueue.Add(cParallel.ToHost(kernel)).Finish()
gsv
  • 277
  • 1
  • 9