24

I have an std::vector<std::uint8_t>, which needs to be duplicated. This is done simply by calling the copy constructor.

My profiling results show, that the Microsoft Visual C++ (msvc100) implementation, uses std::uninitialized_copy internally. This copies every element one-by-one. In this case, a more optimised copy can be done by copying entire blocks of memory at once (like memcpy may do).

In other words, this could be a significant optimization. Is there a way to force the vector to use such an optimised method?

Note: I have tried using std::basic_string<std::uint8_t>, and it does perform better, but it has other issues.

Ruud
  • 3,118
  • 3
  • 39
  • 51

2 Answers2

7

This answer is not specific to the msvc100.

If you use the copy constructor like in

std::vector<uint8_t> newVect(otherVect);

the otherVect's allocator object has to be copied (and used) as well, which needs more efforts to get it performant in the STL implementation.

If you just want to copy the contents of otherVect, use

std::vector<uint8_t> newVect(otherVect.begin(), otherVect.end());

which uses the default allocator for newVect.

Another possibility is

std::vector<uint8_t> newVect; nevVect.assign(otherVect.begin(), otherVect.end());

All of them (including the copy constuctor when otherVect uses the default allocator) should boil down to a memmove/memcpy in a good STL implementation in this case. Take care, that otherVect has exactly the same element type (not e.g. 'char' or 'int8_t') as newVect.

Using the container's method is generally more performant than using generic algorithms, so a combination of vector::resize() and std::copy() or even memmove()/memcpy() would be a work-around, if the vendor didn't optimize the container sufficiently.

Jacob
  • 611
  • 4
  • 6
  • `memmove`?! I guess you mean `memcpy`. I would hate for a copy of a vector (which is not an rvalue reference) to cause the loss of the initial data. – MvG Apr 12 '13 at 08:37
  • 2
    Why do you think memmove would cause the loss of the initial data? – jcoder Apr 12 '13 at 08:41
  • @jcoder: I thought that there were no guarantees about the original data being preserved. I also thought that memmove might move page-sized blocks by manipulating the address translation tables. But the man page does speak of a copy, so it seems i was wrong. Still, [`memmove`](http://sourceware.org/git/?p=glibc.git;a=blob;f=string/memmove.c;h=9dcd2f1f680b8b166af65b1a954f19a480758257;hb=HEAD) has to ensure operation in the right direction, which `memcpy` does not, so the latter should be faster. – MvG Apr 12 '13 at 08:50
  • Yes the only real difference is that memmove has more predictable characteristics when the from and to ranges overlap. Now I think about the name is rather misleading :) – jcoder Apr 12 '13 at 08:55
  • When copying to uninitialised data areas, a memcpy() can be used as well, but when there is the slightest chance that source and destination areas overlap, memmove() must be used instead. – Jacob Apr 12 '13 at 08:56
  • 1
    I originally used, `std::vector newVect(otherVect.begin(), otherVect.end());`, but it is as fast as `std::vector newVect(otherVect);`. `std::copy` is significantly faster (about 40% on my machine). – Ruud Apr 12 '13 at 16:23
2

Based on the suggested solutions, I decided to put together a small benchmark.

#include <cstdint>
#include <cstring>
#include <ctime>
#include <iostream>
#include <random>
#include <vector>

using namespace std;

int main()
{
  random_device seed;
  mt19937 rnd(seed());
  uniform_int_distribution<uint8_t> random_byte(0x00, 0xff);

  const size_t n = 512 * 512;

  vector<uint8_t> source;
  source.reserve(n);
  for (size_t i = 0; i < n; i++) source.push_back(random_byte(rnd));

  clock_t start;
  clock_t t_constructor1 = 0; uint8_t c_constructor1 = 0;
  clock_t t_constructor2 = 0; uint8_t c_constructor2 = 0;
  clock_t t_assign = 0;       uint8_t c_assign = 0;
  clock_t t_copy = 0;         uint8_t c_copy = 0;
  clock_t t_memcpy = 0;       uint8_t c_memcpy = 0;

  for (size_t k = 0; k < 4; k++)
  {
    start = clock();
    for (size_t i = 0; i < n/32; i++)
    {
      vector<uint8_t> destination(source);
      c_constructor1 += destination[i];
    }
    t_constructor1 += clock() - start;

    start = clock();
    for (size_t i = 0; i < n/32; i++)
    {
      vector<uint8_t> destination(source.begin(), source.end());
      c_constructor2 += destination[i];
    }
    t_constructor2 += clock() - start;

    start = clock();
    for (size_t i = 0; i < n/32; i++)
    {
      vector<uint8_t> destination;
      destination.assign(source.begin(), source.end());
      c_assign += destination[i];
    }
    t_assign += clock() - start;

    start = clock();
    for (size_t i = 0; i < n/32; i++)
    {
      vector<uint8_t> destination(source.size());
      copy(source.begin(), source.end(), destination.begin());
      c_copy += destination[i];
    }
    t_copy += clock() - start;

    start = clock();
    for (size_t i = 0; i < n/32; i++)
    {
      vector<uint8_t> destination(source.size());
      memcpy(&destination[0], &source[0], n);
      c_memcpy += destination[i];
    }
    t_memcpy += clock() - start;
  }

  // Verify that all copies are correct, but also prevent the compiler
  // from optimising away the loops
  uint8_t diff = (c_constructor1 - c_constructor2) +
                 (c_assign - c_copy) +
                 (c_memcpy - c_constructor1);

  if (diff != 0) cout << "one of the methods produces invalid copies" << endl;

  cout << "constructor (1): "    << t_constructor1 << endl;
  cout << "constructor (2): "    << t_constructor2 << endl;
  cout << "assign:          "    << t_assign << endl;
  cout << "copy             "    << t_copy << endl;
  cout << "memcpy           "    << t_memcpy << endl;

  return 0;
}

At my PC, compiled for x64 with msvc100, fully optimised, this produces the following output:

constructor (1): 22388
constructor (2): 22333
assign:          22381
copy             2142
memcpy           2146

The results are quite clear: std::copy performs as well as std::memcpy, whereas both constructors and assign are an order of magnitude slower. Of course the exact numbers and ratios depend on the vector size, but the conclusion for msvc100 is obvious: as suggested by Rapptz, use std::copy.

Edit: the conclusion is not obvious for other compilers. I tested at 64-bit Linux as well, with the following outcome for Clang 3.2

constructor (1): 530000
constructor (2): 560000
assign:          560000
copy             840000
memcpy           860000

GCC 4.8 gives similar output. For GCC on Windows, memcpy and copy were slightly slower than the constructors and assign, although the difference was smaller. However, my experience is that GCC does not optimise very well on Windows. I tested msvc110 as well, and the results were similar to msvc100.

Community
  • 1
  • 1
Ruud
  • 3,118
  • 3
  • 39
  • 51
  • 1
    I measured with gcc 4.6.3 under Linux/64bit and got constructor (1): 530000, constructor (2): 530000, assign: 550000, copy 830000, memcpy 840000 (don't mind the bigger values, CLOCKS_PER_SEC is probably different). So it's completely the other way around. If your code is _not meant to be portable_ , using copy is surely a fine workaround. – Jacob Apr 13 '13 at 23:56
  • Amazing! I checked this with VS2012Express and it's essentially the same. Somehow I'd call that an implementation bug. – Martin Ba Oct 25 '13 at 20:56
  • Right, it's not optimized! Long time ago, I watched the following video at 1:02:50 and believed that all was well optimized... https://channel9.msdn.com/Series/C9-Lectures-Stephan-T-Lavavej-Standard-Template-Library-STL-/C9-Lectures-Introduction-to-STL-with-Stephan-T-Lavavej – Daniel Laügt Mar 14 '19 at 12:59