I took your example and it runs fine.
Code:
var gpu = Gpu.Default;
var blas = Blas.Get(Gpu.Default);
var hostA = new float[,]
{
{1, 2, 3},
{4, 5, 6},
{7, 8, 9},
};
var hostB = new float[,]
{
{10, 20, 30},
{40, 50, 60},
{70, 80, 90},
};
PrintArray(hostA);
PrintArray(hostB);
var deviceA = gpu.AllocateDevice(hostA);
var deviceB = gpu.AllocateDevice(hostB);
blas.Axpy(deviceA.Length, 1f, deviceA.Ptr, 1, deviceB.Ptr, 1);
var hostC = Gpu.Copy2DToHost(deviceB);
PrintArray(hostC);
Print Helper:
private static void PrintArray(float[,] array)
{
for (var i = 0; i < array.GetLength(0); i++)
{
for (var k = 0; k < array.GetLength(1); k++)
{
Console.Write("{0} ", array[i, k]);
}
Console.WriteLine();
}
Console.WriteLine(new string('-', 10));
}
This is what I get:

Two questions:
- What version of AleaGpu are you using?
- What version of the CUDA Toolkit are you using?
I coded my sample against: Alea 3.0.4-beta2 and I have CudaToolkit 8.0.
Just to be sure I tried to code your example in F#.
(I'm not fluent in F#)
Code:
let gpu = Gpu.Default;
let blas = Blas.Get(Gpu.Default);
let hostA: float[,] = array2D [[ 1.0; 2.0; 3.0 ]; [ 4.0; 5.0; 6.0 ]; [ 7.0; 8.0; 9.0 ]]
let hostB: float[,] = array2D [[ 10.0; 20.0; 30.0 ]; [ 40.0; 50.0; 60.0 ]; [ 70.0; 80.0; 90.0 ]]
PrintArray(hostA)
PrintArray(hostB)
use deviceA = gpu.AllocateDevice(hostA);
use deviceB = gpu.AllocateDevice(hostB);
blas.Axpy(deviceA.Length, 1.0, deviceA.Ptr, 1, deviceB.Ptr, 1);
let hostC = Gpu.Copy2DToHost(deviceB);
PrintArray(hostC)
Print Helper:
let PrintArray(array: float[,]): unit =
for i in 0 .. array.GetLength(0) - 1 do
for k in 0 .. array.GetLength(1) - 1 do
Console.Write("{0} ", array.[i, k]);
Console.WriteLine();
Console.WriteLine(new string('-', 10));