0

In my project I've to copy a lot of numerical data in an std::valarray (or std::vector) from a CUDA (GPU) device (from the memory of the video-card to std::valarray).

So I need to resize these data-structures as faster as possible but when I call the member method vector::resize it initialize all elements of the array to the default value, with a loop.

// In a super simplified description resize behave like this pseudocode:
vector<T>::resize(N){
   // Setup the new size

   // allocate the new array
   this->_internal_vector = new T[N];

   // init to default
   // This loop is slow !!!!
   for ( i = 0; i < N ; ++i){
      this->_internal_vector[i] = T();
   }
}

Clearly I don't need this initialization because I've to copy data from the GPU and all old data are overwritten. And the initialization require some time; so I've a loss of performance.

For coping the data I need allocated memory; generated by the method resize().

I very dirty and wrong solution is to use the method vector::reserve(), but I lost all the features of the vector; and if I resize the data are replaced with the default value.

So, if you know, there exists a strategy for avoiding this pre-initialization to the default value (in valarray or vector).

I want a method resize that behave like this:
vector<T>::resize(N) {
    // Allocate the memory.
    this->_internal_vector = new T[N];

    // Update the the size of the vector or valarray

    // !! DO NOT initialize the new values.
}

An example of the performances:

#include <chrono>
#include <iostream>
#include <valarray>
#include <vector>

int main() {

  std::vector<double> vec;
  std::valarray<double> vec2;

  double *vec_raw;

  unsigned int N = 100000000;

  std::clock_t start;
  double duration;

  start = std::clock();
  // Dirty solution!
  vec.reserve(N);

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration reserve: " << duration << std::endl;

  start = std::clock();

  vec_raw = new double[N];

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration new: " << duration << std::endl;

  start = std::clock();

  for (unsigned int i = 0; i < N; ++i) {
    vec_raw[i] = 0;
  }

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration raw init: " << duration << std::endl;

  start = std::clock();
  // Dirty solution
  for (unsigned int i = 0; i < vec.capacity(); ++i) {
    vec[i] = 0;
  }

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration vec init dirty: " << duration << std::endl;

  start = std::clock();

  vec2.resize(N);

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration valarray resize: " << duration << std::endl;

  return 0;
}

Output:

duration reserve: 1.1e-05
duration new: 1e-05
duration raw init: 0.222263
duration vec init dirty: 0.214459
duration valarray resize: 0.215735

Note: replacing the std::allocator does not work because the loop is called by the resize().

Giggi
  • 681
  • 2
  • 9
  • 17
  • Your initialization of the vector `vec` is *wrong*! The `reserve` function only allocates memory, but the actual size is still unchanged. That means you index *out of bounds* and have *undefined behavior*. – Some programmer dude Mar 10 '18 at 13:13
  • Also, if you want to set all elements of an array (actual or dynamically allocated) or vector to a single value, use [`std::fill`](http://en.cppreference.com/w/cpp/algorithm/fill) or [`std::fill_n`](http://en.cppreference.com/w/cpp/algorithm/fill_n) instead of explicit loops. You could also use [`std::memset`](http://en.cppreference.com/w/cpp/string/byte/memset) in both cases. – Some programmer dude Mar 10 '18 at 13:15
  • @Some programmer dude Yes it is wrong! But it is fast. – Giggi Mar 10 '18 at 13:15
  • @Some programmer dude I need a block of raw allocated memory (like the old styled malloc()) but generated in a std::vector. Coping memory from a video-card to a vector with the c++ standard libs it's impossible. – Giggi Mar 10 '18 at 13:21
  • 2
    It doesn't matter if it's "fast". Wrong is still wrong, and you're very lucky it seems to work for you. Another compiler, or even a new version of the one you have, might lead to your program crashing unexpectedly (and maybe not even there). – Some programmer dude Mar 10 '18 at 13:23
  • @Giggi, if you have at least read-only access to the memory, you can copy it directly wherever you need. And the fastest solution on windows is still `memcpy`. – Smit Ycyken Mar 10 '18 at 13:26
  • @Smit Ycyken memcpy doesn't work for coping memory from a VIDEO-CARD to RAM, and do not change the problem: having an un-initialized vector. – Giggi Mar 10 '18 at 13:33

2 Answers2

3

Let's say you have an array (or some collection) with the data called data and you want to copy it to a vector vec. Then the idiomatic way to do this would be to use std::vector::reserve and then std::vector::push_back. std::vector::reserve will allocate memory for the std::vector but it will not initialize the memory, or set the internal counter etc. std::vector::push_back will insert the data and update the vector's size. Optionally, use std::vector::insert that takes two iterators, to avoid looping and pushing back every element individually.

std::vector<double> vec;
vec.reserve(std::size(data)); // Allocate all data in one call.
vec.insert(std::begin(vec), std::begin(data), std::end(data)); // Insert the data elements.

Alternatively you can use std::vector's ctor overload that takes two iterators:

std::vector<double> vec{std::begin(data), std::end(data)};

This will also allocate all data in a single call, and then add the elements.

Update

If you know the data size in advance, you could simply use std::array, e.g.:

constexpr const std::size_t N = 10'000;
std::array<double, N> arr;

arr[5432] = 2.5; // Perfectly valid.
// Or e.g. for CUDA.
cudaMemcpy(std::data(arr), gpu_arr, std::size(arr), cudaMemcpyDeviceToHost);

All data will be allocated at once, and no default initialization will be performed (values are default initialized, but for fundamental types this means nothing is done [indeterminate values]).

std::array has all the advantages of C++ collections as std::size, std::begin, std::end, std::data etc.

Felix Glas
  • 15,065
  • 7
  • 53
  • 82
  • They are good solutions for coping data RAM to RAM. But my problem is coping data form video-card to RAM. So for maximizing the performance avoiding the initialization is the best solution! – Giggi Mar 10 '18 at 13:27
  • @Giggi Well, without knowledge of how your GPU data is represented it's hard to reason about how to effectively copy it. If you want to use higher level C++ abstractions then take a look at _e.g._ [Thrust library](http://docs.nvidia.com/cuda/thrust/index.html) (for CUDA). – Felix Glas Mar 10 '18 at 13:57
  • I know well the Thrust libs. But in my specific case I've to pass the data to a code that works with the c++ containers (that's all). – Giggi Mar 10 '18 at 14:07
1

If you are working with plain old data (no pointers or references, just integers and floats), it may be best to just use a plain old array. Combine that with correct use of memcpy(), and you are pretty much guaranteed to get much better performance than any native C++ implementation.

The point is, that C++ cannot really handle swaths of data as swaths of data. It has to handle individual objects of unknown type. It does not know whether these objects may be copied by copying their bits, it must call the adequate default, copy, or move constructors, (move) assignment operators, and destructor for each individual element. While good C++ compilers are able to remove much of the resulting garbage code, the result generally cannot compete with the carefully hand-optimized implementations of memcpy() that can just copy in chunks of 16 or more bytes, blissfully ignorant of whether these are actually eight shorts, two doubles, or 1.33 instances of struct { float x,y,z; }.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • From a performance point of view you are right, and from a programming discipline the interface of the std libs is correct. But you must also consider that std::vector has a very good interface, and a complete library (iterator, algorithms ... ) so the usage of std::vector or std::vlarray is a good idea for developing a good and scalable application. – Giggi Mar 12 '18 at 10:45
  • True. In the end, it all comes down to how much you value performance. Either, you are willing to pay execution time for easier programming, or you are willing to pay programming effort for faster execution. Both options are before you, what will you choose? – cmaster - reinstate monica Mar 12 '18 at 10:50