I use arrayfire to speed up some C++ code with the help of the GPU (OpenCL). I have af::array's of 600MB and more which I need to flip along the the column dimension and then transpose it.
So far I did these operations almost in place with a C++ routine. I would now however like to do it with AF but noticed excessive memory use of the AF library. I have two problems with this:
1) I completely fail to see why any operation (such as flip or T) on a 300MB array should ever use much more than 900MB of memory. 2) I would like to know how to avoid creating a copy of the array foo. I thought by encapsulating the operations within a separate function I would get rid of any copies.
I have code like this:
void prepare_array(af::array &a) {
af::array b = af::flip(a, 1); // ~1400MB
a = b.T(); // ~3000MB
}
af::array foo = af::randn(768,16384,3,1,c64); // ~300MB
prepare_array(foo);
af::deviceGC(); // ~600MB
I need to do this operation only once so speed is less important then memory usage but I would have preferred to do this operations within the AF framework.
(All memory usage statistics are read out with gpustat from the NVIDIA kernel driver package on debian.)
The memory usage is as excessive for the CPU backend too.
Thanks to the reply umar-arshad: When I took profiled the mem-usage last time I ran the code on the CPU - assuming it would behave equally. I double checked measurements on the GPU and using both gpustat and nvidia-smi. Indeed the code the measurements were different and as you explained. It makes all perfect sense now - at least the GPU part.
Maybe on the CPU foo is at first only f64 because only the real part is used and it becomes a c64 by either using flip or transposition.
The fact that "allocations trigger an implicit device synchronize on all queues on certain platforms" together with this website: http://forums.accelereyes.com/forums/viewtopic.php?f=17&t=43097&p=61730&hilit=copy+host+memory+into+an+array#p61727 and af::printMemInfo(); helped me to finally figure out most of the memory handling of AF. Speeding up my program vastly.
However for now still the only alternative to do these two operations in-place (or with as little overhead as possible) is to use:
// Generate/store data in std::array<af::cdouble> foo_unwrap = new af::cdouble[768*16384*3*1];
// Flip/Transpose foo_unwrap in plain C/C++, like in:
// for(column = 0; column < max_num_column/2; column++)
// swap column with max_num_column-1-column
//
// http://www.geeksforgeeks.org/inplace-m-x-n-size-matrix-transpose/
// but for Column-Major Order matrices
//
// and afterwards whenever needed do ad-hoc:
af::cdouble* first_elem = (af::cdouble*) &(foo_unwrap[0]); // to ensure correct type detection via AF
af::array foo = af::array(768,16384,3,1, first_elem, afDevice);
However this is quite cumbersome because I didn't want to bother about Row-/Column-Major format and the index magic. So I'm looking still for suggestions here.