How avoid memcpy with flip and transpose in arrayfire?

Question

I use arrayfire to speed up some C++ code with the help of the GPU (OpenCL). I have af::array's of 600MB and more which I need to flip along the the column dimension and then transpose it.

So far I did these operations almost in place with a C++ routine. I would now however like to do it with AF but noticed excessive memory use of the AF library. I have two problems with this:

1) I completely fail to see why any operation (such as flip or T) on a 300MB array should ever use much more than 900MB of memory. 2) I would like to know how to avoid creating a copy of the array foo. I thought by encapsulating the operations within a separate function I would get rid of any copies.

I have code like this:

void prepare_array(af::array &a) {
  af::array b = af::flip(a, 1);              // ~1400MB
  a = b.T();                                 // ~3000MB
}

af::array foo = af::randn(768,16384,3,1,c64); // ~300MB
prepare_array(foo);
af::deviceGC();                              // ~600MB

I need to do this operation only once so speed is less important then memory usage but I would have preferred to do this operations within the AF framework.

(All memory usage statistics are read out with gpustat from the NVIDIA kernel driver package on debian.)

The memory usage is as excessive for the CPU backend too.

Thanks to the reply umar-arshad: When I took profiled the mem-usage last time I ran the code on the CPU - assuming it would behave equally. I double checked measurements on the GPU and using both gpustat and nvidia-smi. Indeed the code the measurements were different and as you explained. It makes all perfect sense now - at least the GPU part.

Maybe on the CPU foo is at first only f64 because only the real part is used and it becomes a c64 by either using flip or transposition.

The fact that "allocations trigger an implicit device synchronize on all queues on certain platforms" together with this website: http://forums.accelereyes.com/forums/viewtopic.php?f=17&t=43097&p=61730&hilit=copy+host+memory+into+an+array#p61727 and af::printMemInfo(); helped me to finally figure out most of the memory handling of AF. Speeding up my program vastly.

However for now still the only alternative to do these two operations in-place (or with as little overhead as possible) is to use:

// Generate/store data in std::array<af::cdouble> foo_unwrap = new af::cdouble[768*16384*3*1];

// Flip/Transpose foo_unwrap in plain C/C++, like in:
// for(column = 0; column < max_num_column/2; column++)
//   swap column with max_num_column-1-column
//
// http://www.geeksforgeeks.org/inplace-m-x-n-size-matrix-transpose/
//   but for Column-Major Order matrices
//
// and afterwards whenever needed do ad-hoc:
af::cdouble* first_elem = (af::cdouble*) &(foo_unwrap[0]); // to ensure correct type detection via AF
af::array foo = af::array(768,16384,3,1, first_elem, afDevice);

However this is quite cumbersome because I didn't want to bother about Row-/Column-Major format and the index magic. So I'm looking still for suggestions here.

Yeah I don't know what `gpustat` is doing, but the numbers are factually wrong (Refer to Umar's answer below). You can use `nvidia-smi` for better information. You can also use `af::deviceMemInfo` to see what arrayfire is doing internally. — Pavan Yalamanchili, Dec 04 '16 at 10:57
Normally when I work with ArrayFire I treat the dimensions as 1st dimension, second dimension rather than column, row. As long as you are consistent, the representation of dimensions shouldn't matter. Of course this doesn't apply to Linear Algebra operations but some functions in AF(matmul specifically) have option parameters which perform the computation as if the array was transposed without reordering the data. Additional context about your application could also help. SO is not the best forum for this type of discussion. We are fairly active on google groups and gitter. — Umar Arshad, Dec 06 '16 at 05:12

score 3 · Answer 1 · answered Dec 04 '16 at 05:07

ArrayFire uses a memory manager to avoid unnecessary allocation and deallocations. This is required because all allocations trigger an implicit device synchronize on all queues on certain platforms. This can get very costly so ArrayFire will track af::arrays which go out of bounds and reuse them if necessary. ArrayFire also allocates memory at startup.

In your case you are allocating ~600MB at the randu call(c64 is a complex double so each element is 16 bytes). There is another 600MB allocation for the flip operation which is stored in b. Transpose will allocate 600MB but will reserve the older values for reuse. At this point you have about 1800 MB of memory allocated because of these operations.

When you return from the prepared_array function call b will go out of scope and will be marked for deletion. At this point you have 3 buffers at 600MB each. Two of the buffers are unused but ArrayFire can use those buffers in future operations. Both of the unused arrays will be freed once you call the deviceGC function although chances are you will be allocating arrays of a similar size so it's useful to keep these around.

You can track allocations made by ArrayFire by using the af::printMemInfo() function.

Disclamer: I am one of the developers of ArrayFire.

Additionally, if your matrix for transpose is square, then you can use [transposeInPlace](http://arrayfire.org/docs/group__blas__func__transpose.htm#gae77f8ba484534fe5bf85c73f8641c133) — shehzan, Dec 05 '16 at 21:09

How avoid memcpy with flip and transpose in arrayfire?

1 Answers1