3

My application consists of calling dozens of functions millions of times. In each of those functions, one or a few temporary std::vector containers of POD (plain old data) types are initialized, used, and then destructed. By profiling my code, I find the allocations and deallocations lead to a huge overhead.

A lazy solution is to rewrite all the functions as functors containing those temporary buffer containers as class members. However this would blow up the memory consumption as the functions are many and the buffer sizes are not trivial.

A better way is to analyze the code, gather all the buffers, premeditate how to maximally reuse them, and feed a minimal set of shared buffer containers to the functions as arguments. But this can be too much work.

I want to solve this problem once for all my future development during which temporary POD buffers become necessary, without having to have much premeditation. My idea is to implement a container port, and take the reference to it as an argument for every function that may need temporary buffers. Inside those functions, one should be able to fetch containers of any POD type from the port, and the port should also auto-recall the containers before the functions return.

// Port of vectors of POD types.
struct PODvectorPort 
{
  std::size_t Nlent; // Number of dispatched containers.
  std::vector<std::vector<std::size_t> > X; // Container pool.
  PODvectorPort() { Nlent = 0; }
};


// Functor that manages the port.
struct PODvectorPortOffice
{
  std::size_t initialNlent; // Number of already-dispatched containers
  // when the office is set up.
  PODvectorPort *p; // Pointer to the port.
  
  
  PODvectorPortOffice(PODvectorPort &port) 
  { 
    p = &port; 
    initialNlent = p->Nlent;
  }
  
  
  template<typename X, typename Y>
  std::vector<X> & repaint(std::vector<Y> &y) // Repaint the container.
  {
    // return *((std::vector<X>*)(&y)); // UB although works
    std::vector<X> *rst = nullptr;
    std::memcpy(&rst, &y, std::min(
      sizeof(std::vector<X>*), sizeof(std::vector<Y>*)));
    return *rst; // guess it makes no difference. Should still be UB.
  }
  
  
  template<typename T>
  std::vector<T> & lend()
  {
    ++p->Nlent;
    // Ensure sufficient container pool size:
    while (p->X.size() < p->Nlent) p->X.push_back( std::vector<size_t>(0) );
    return repaint<T, std::size_t>( p->X[p->Nlent - 1] );
  }

  
  void recall() { p->Nlent = initialNlent; }
  ~PODvectorPortOffice() { recall(); }
};


struct ArbitraryPODstruct
{
  char a[11]; short b[7]; int c[5]; float d[3]; double e[2];
};


// Example f1():
// f2(), f3(), ..., f50() are similarly defined.
// All functions are called a few million times in certain 
// order in main(). 
// port is defined in main().
void f1(other arguments..., PODvectorPort &port)
{
  
  PODvectorPort portOffice(port);
  
  // Oh, I need a buffer of chars:
  std::vector<char> &tmpchar = portOffice.lend();
  tmpchar.resize(789); // Trivial if container already has sufficient capacity.
  
  // ... do things
  
  // Oh, I need a buffer of shorts:
  std::vector<short> &tmpshort = portOffice.lend();
  tmpshort.resize(456);  // Trivial if container already has sufficient capacity.
  
  // ... do things.
  
  // Oh, I need a buffer of ArbitraryPODstruct:
  std::vector<ArbitraryPODstruct> &tmpArb = portOffice.lend();
  tmpArb.resize(123);  // Trivial if container already has sufficient capacity.
  
  // ... do things.
  
  // Oh, I need a buffer of integers, but also tmpArb is no longer 
  // needed. Why waste it? Cache hot.
  std::vector<int> &tmpint = portOffice.repaint(tmpArb);
  tmpint.resize(300); // Trivial.
  
  // ... do things.
}

Although the code is compliable by both gcc-8.3 and MSVS 2019 with -O2 to -Ofast, and passes extensive tests for all options, I expect criticism due to the hacky nature of PODvectorPortOffice::repaint(), which "casts" the vector type in-place.

A set of sufficient but not necessary conditions for the correctness and efficiency of the above code are:

  1. std::vector<T> stores 3 pointers to the underlying buffer's &[0], &[0] + .size(), &[0] + .capacity().
  2. std::vector<T>'s allocator calls malloc().
  3. malloc() returns an 8-byte (or sizeof(std::size_t)) aligned address.

So, if this is unacceptable to you, what would be the modern, proper way of addressing my need? Is there a way of writing a manager that achieve what my code does only without violating the Standard?

Thanks!

Edits: A little more context of my problem. Those functions mainly compute some simple statistics of the inputs. The inputs are data streams of financial parameters of different types and sizes. To compute the statistics, those data need to be altered and re-arranged first, thus the buffers for temporary copies. Computing the statistics is cheap, thus the allocations and deallocations can become expensive, relatively. Why do I want a manger for arbitrary POD type? Because 2 weeks from now I may start receiving a data stream of a different type, which can be a bunch of primitive types zipped in a struct, or a struct of the composite types encountered so far. I, of course, would like the upper stream to just send separate flows of primitive types, but I have no control of that aspect.

More edits: after tons of reading and code experimenting regarding the strict aliasing rule, the answer should be, don't try everything I put up there --- it works, for now, but don't do it. Instead, I'll be diligent and stick to my previous code-as-you-go style, just add a vector<vector<myNewType> > into the port once a new type comes up, and manage it in a similar way. The accepted answer also offers a nice alternative.

Even more edits: conceived a stronger class that has better chance to thwart potential optimizations under the strict aliasing rule. DO NOT USE IT WITHOUT TESTING AND THOROUGH UNDERSTANDING OF THE STRICT ALIASING RULE.

// -std=c++17
#include <cstring>
#include <cstddef>
#include <iostream>
#include <vector>
#include <chrono>


// POD: plain old data.
// Idea: design a class that can let you maximally reuse temporary 
// containers during a program.
// Port of vectors of POD types.
template <std::size_t portsize = 42>
class PODvectorPort 
{
  static constexpr std::size_t Xsize = portsize;
  std::size_t signature;
  std::size_t Nlent; // Number of dispatched containers.
  std::vector<std::size_t> X[portsize]; // Container pool.
  PODvectorPort(const PODvectorPort &);
  PODvectorPort & operator=( const PODvectorPort& );
  
  
public:
  std::size_t Ndispatched() { return Nlent; }
  std::size_t showSignature() { return signature; }
  PODvectorPort() // Permuted random number generator.
  { 
    std::size_t state = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    state ^= (uint64_t)(&std::memmove);
    signature = ((state >> 18) ^ state) >> 27;
    std::size_t rot = state >> 59;
    signature = (signature >> rot) | (state << ((-rot) & 31));
    Nlent = 0; 
  }
  
  
  template<typename podvecport>
  friend class PODvectorPortOffice;
};


// Functor that manages the port.
template<typename podvecport>
class PODvectorPortOffice
{
  // Number of already-dispatched containers when the office is set up.
  std::size_t initialNlent; 
  podvecport *p; // Pointer to the port.
  
  
  PODvectorPortOffice( const PODvectorPortOffice& ); // non construction-copyable
  PODvectorPortOffice& operator=( const PODvectorPortOffice& ); // non copyable
  
  
  constexpr void check()
  {
    while (__cplusplus < 201703)
    {
      std::cerr << "PODvectorPortOffice: C++ < 17, Stall." << std::endl;
    }
    
    
    // Check if allocation will be 8-byte (or more) aligned.
    // Intend it not to work on machine < 64-bit.
    constexpr std::size_t aln = alignof(std::max_align_t);
    while (aln < 8)
    {
      std::cerr << "PODvectorPortOffice: Allocation is not at least 8-byte aligned, Stall." <<
        std::endl;
    }
    
    
    while ((aln & (aln - 1)) != 0)
    {
      std::cerr << "PODvectorPortOffice: Alignment is not a power of 2 bytes. Stall." << std::endl;
    }
    
    
    // Random checks to see if sizeof(vector<S>) != sizeof(vector<T>).
    if(true)
    {
      std::size_t vecHeadSize[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
      vecHeadSize[0] = sizeof(std::vector<char>(0));
      vecHeadSize[1] = sizeof(std::vector<short>(1));
      vecHeadSize[2] = sizeof(std::vector<int>(2));
      vecHeadSize[3] = sizeof(std::vector<long>(3));
      vecHeadSize[4] = sizeof(std::vector<std::size_t>(5));
      vecHeadSize[5] = sizeof(std::vector<float>(7));
      vecHeadSize[6] = sizeof(std::vector<double>(11));
      vecHeadSize[7] = sizeof(std::vector<std::vector<char> >(13));
      vecHeadSize[8] = sizeof(std::vector<std::vector<int> >(17));
      vecHeadSize[9] = sizeof(std::vector<std::vector<double> >(19));
      struct tmpclass1 { char a; short b; };
      struct tmpclass2 { char a; float b; };
      struct tmpclass3 { char a; double b; };
      struct tmpclass4 { int a; char b; };
      struct tmpclass5 { double a; char b; };
      struct tmpclass6 { double a[5]; char b[3]; short c[3]; };
      vecHeadSize[10] = sizeof(std::vector<tmpclass1>(23));
      vecHeadSize[11] = sizeof(std::vector<tmpclass2>(29));
      vecHeadSize[12] = sizeof(std::vector<tmpclass3>(31));
      vecHeadSize[13] = sizeof(std::vector<tmpclass4>(37));
      vecHeadSize[14] = sizeof(std::vector<tmpclass4>(41));
      vecHeadSize[15] = sizeof(std::vector<tmpclass4>(43));
      
      
      std::size_t notSame = 0;
      for(int i = 0; i < 16; ++i) 
        notSame += vecHeadSize[i] != sizeof(std::size_t) * 3;
      while (notSame)
      {
        std::cerr << "sizeof(std::vector<S>) != sizeof(std::vector<T>), \
PODvectorPortOffice cannot handle. Stall." << std::endl;
      }
    }
  }
  
  
  void recall() { p->Nlent = initialNlent; }
  
  
public:  
  PODvectorPortOffice(podvecport &port) 
  { 
    check();
    p = &port; 
    initialNlent = p->Nlent;
  }
  
  
  template<typename X, typename Y>
  std::vector<X> & repaint(std::vector<Y> &y) // Repaint the container.
    // AFTER A VECTOR IS REPAINTED, DO NOT USE THE OLD VECTOR AGAIN !!
  {
    while (std::is_same<bool, X>::value)
    {
      std::cerr << "PODvectorPortOffice: Cannot repaint the vector to \
std::vector<bool>. Stall." << std::endl;
    }
    std::vector<X> *x;
    std::vector<Y> *yp = &y;
    std::memcpy(&x, &yp, sizeof(x));
    return *x; // Not compliant with strict aliasing rule.
  }
  
  
  template<typename T>
  std::vector<T> & lend()
  {
    while (p->Nlent >= p->Xsize)
    {
      std::cerr << "PODvectorPortOffice: No more containers. Stall." << std::endl;
    }
    ++p->Nlent;
    return repaint<T, std::size_t>( p->X[p->Nlent - 1] );
  }
  
  
  ~PODvectorPortOffice() 
  { 
    // Because p->signature can only be known at runtime, an aggressive,
    // compliant compiler (ACC) will never remove this 
    // branch. Volatile might do, but trustworthiness?
    if(p->signature == 0)
    {
      constexpr std::size_t sizeofvec = sizeof(std::vector<std::size_t>);
      char dummy[sizeofvec * p->Xsize];    
      std::memcpy(dummy, p->X, p->Nlent * sizeofvec);
      std::size_t ticketNum = 0;
      char *xp = (char*)(p->X);
      for(int i = 0, iend = p->Nlent * sizeofvec; i < iend; ++i)
      {
        xp[i] &= xp[iend - i - 1] * 5;
        ticketNum += xp[i] ^ ticketNum;
      }
      std::cerr << "Congratulations! After the port office was decommissioned, \
you found a winning lottery ticket. The odds is less than 2.33e-10. Your \
ticket number is " << ticketNum << std::endl;
      std::memcpy(p->X, dummy, p->Nlent * sizeofvec);
      // According to the strict aliasing rule, a char* can point to any memory
  // block pointed by another pointer of any type T*. Thus given an ACC, 
  // the writes to that block via the char* must be fully acknowledged in 
  // time by T*, namely, for reading contents from T*, a reload instruction 
  // will be kept in the assembly code to achieve a sort of 
  // "register-cache-memory coherence" (RCMC).
  // We also do not care about the renters' (who received the reference via
  // .lend()) RCMC, because PODvectorPortOffice never accesses the contents
  // of those containers.
    }
    recall(); 
  }
};

Any adversarial test case to break it, especially on GCC>=8.3 or MSVS >= 2019, is welcomed!

user2961927
  • 1,290
  • 1
  • 14
  • 22
  • 4
    `I expect criticism due to the hacky nature` And your instinct is right on the money. That cast is a huge violation of [the strict aliasing rule](https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule). The moment your compiler starts optimizing your code, everything's going to go south fast. It's undefined behavior. If you want to do bit-level manipulations like that, you write Assembly code, not C++. – Silvio Mayolo Jan 19 '22 at 04:13
  • @SilvioMayolo Understood. Thanks – user2961927 Jan 19 '22 at 04:27
  • You could write your container port as template to manage recycling per POD type. Uhm... How many different POD types we're talking about? – Scheff's Cat Jan 19 '22 at 07:07
  • @Scheff'sCat Currently 18 distinct types. One of my previous solutions is to create a struct as the pool of containers of different types in a header file, and adopt a code-as-you-go style --- keep adding a container type in the pool if that type has never occurred. Are you suggesting a similar solution? If so, I want to be even lazier than that.. I guess the dream dies hard. Sometimes it is a little frustrating to know exactly when aliasing would cause problem and how to avoid it, but the standard still enforces you not to consider it at all. – user2961927 Jan 19 '22 at 16:20
  • Interesting question, are you forced to use `std::vector` for some reason, or could you feasibly swap out the container type if needed? Does your algorithm actually require the ability to dynamically resize, arbitrary push/pop/insert/etc. for each vector, or is the use case constrained in some way? – Jon Reeves Jan 19 '22 at 22:34
  • @JonReeves I am mostly forced to use std::vector. The functions do call member functions including .resize(), .push_back(), .pop() but not .insert() or .erase() etc. My intention is a little more ambitious --- to make the programming as convenient as declaring vectors inside the functions, but without the large overhead they cause.. But after reading much about the aliasing rule, it doesn't seem likely -( – user2961927 Jan 20 '22 at 06:28
  • Hmm... I once used a private buffer pool to save many allocation/deallocation in a C program. Unsure whether it is really an option here, but you could try to use a custom allocator for your vectors. – Serge Ballesta Jan 20 '22 at 06:53
  • Is there an upper bound of size that you could easily `reserve`? – Caleth Jan 20 '22 at 10:47
  • @SergeBallesta Interesting. How did you handle the aliasing issue? Or did you allocate different buffers for different types? – user2961927 Jan 20 '22 at 14:11
  • @Caleth It's hard to know, but even I can estimate an upper size to `reserve`, would that make a big difference? – user2961927 Jan 20 '22 at 14:12
  • It isn't clear if you problem is you have lots of reallocation going on, i.e. most vector change capacity multiple times, or that you have lots of vectors that get set to some initial capacity and hang around longer than you actually need them. In the former case I'd look at `reserve`ing more, in the latter case I'd find a narrower scope for these vectors. – Caleth Jan 20 '22 at 14:24
  • @user2961927: I must admit that I only used one single type... – Serge Ballesta Jan 20 '22 at 14:24
  • @Caleth The problem, as stated in the post, is close to the former in the legacy code --- buffers being allocated and deallocated too many unnecessary times. My question is that can we create something similar to the code in my post to manage the buffers only without violating the Standard – user2961927 Jan 20 '22 at 14:44
  • @user2961927 I think Serge's question about custom allocators is going to be most relevant to you. Most allocators will use a `char[]` array as backing store and cast to the template type as needed. This is also true of the default `std::allocator`, and it doesn't violate strict aliasing to do so. I probably need more space to write something up than a comment box so I'll attempt an answer sometime today and maybe it will help. – Jon Reeves Jan 20 '22 at 15:30
  • @JonReeves Looking forward to it! If there's some techniques for me to reuse a block of memory for arbitrary POD types without violating any rule, that would be exactly what I want:) – user2961927 Jan 20 '22 at 15:42
  • @SilvioMayolo When you mentioned asm, did you mean in general bit manipulation is better done in asm, or you implied a solution using asm for complying/circumventing the strict aliasing rule in type punning particularly. If the latter, could you give an example or link me to some resource? Thanks! – user2961927 Jan 20 '22 at 16:56

1 Answers1

2

Let me frame this by saying I don't think there's an "authoritative" answer to this question. That said, you've provided enough constraints that a suggested path is at least worthwhile. Let's review the requirements:

  • Solution must use std::vector. This is in my opinion the most unfortunate requirement for reasons I won't get into here.
  • Solution must be standards compliant and not resort to rule violations, like the strict aliasing rule.
  • Solution must either reduce the number of allocations performed, or reduce the overhead of allocations to the point of being negligible.

In my opinion this is definitely a job for a custom allocator. There are a couple of off-the-shelf options that come close to doing what you want, for example the Boost Pool Allocators. The one you're most interested in is boost::pool_allocator. This allocator will create a singleton "pool" for each distinct object size (note: not object type), which grows as needed, but never shrinks until you explicitly purge it.

The main difference between this and your solution is that you'll have distinct pools of memory for objects of different sizes, which means it will use more memory than your posted solution, but in my opinion this is a reasonable trade-off. To be maximally efficient, you could simply start a batch of operations by creating vectors of each needed type with an appropriate size. All subsequent vector operations which use these allocators will do trivial O(1) allocations and deallocations. Roughly in pseudo-code:

// be careful with this, probably want [[nodiscard]], this is code
// is just rough guidance:
void force_pool_sizes(void)
{
    std::vector<int, boost::pool_allocator<int>> size_int_vect;
    std::vector<SomePodSize16, boost::pool_allocator<SomePodSize16>> size_16_vect;
    ...
    
    size_int_vect.resize(100); // probably makes malloc calls
    size_16_vect.resize(200);  // probably makes malloc calls
    ...

    // on return, objects go out of scope, but singleton pools
    // with allocated blocks of memory remain for future use
    // until explicitly purged.
}

void expensive_long_running(void)
{
    force_pool_sizes();

    std::vector<int, boost::pool_allocator<int>> data1;
    ... do stuff, malloc/free will never be called...

    std::vector<SomePodSize16, boost::pool_allocator<SomePodSize16>> data2;
    ... do stuff, malloc/free will never be called...

    // free everything:
    boost::singleton_pool<boost::pool_allocator_tag, sizeof(int)>::release_memory();
}

If you want to take this a step further on being memory efficient, if you know for a fact that certain pool sizes are mutually exclusive, you could modify the boost pool_allocator to use a slightly different singleton backing store which allows you to move a memory block from one block size to another. This is probably out of scope for now, but the boost code itself is straightforward enough, if memory efficiency is critical, it's probably worthwhile.

It's worth pointing out that there's probably some confusion about the strict aliasing rule, especially when it comes to implementing your own memory allocators. There are lots and lots of SO questions about strict aliasing and what it does and doesn't mean. This one is a good place to start.

The key takeaway is that it's perfectly ordinary and acceptable in low level C++ code to take an array of memory and cast it to some object type. If this were not the case, std::allocator wouldn't exist. You also wouldn't have much use for things like std::aligned_storage. Look at the example use case for std::aligned_storage on cppreference. An STL-like static_vector class is created which keeps an array of aligned_storage objects that get recast to a concrete type. Nothing about this is "unacceptable" or "illegal", but it does require some additional knowledge and care in handling.

The reason your solution is especially going to enrage the code lawyers is that you're taking pointers of one non-char object type and casting them to different non-char object types. This is a particularly offensive violation of the strict aliasing rule, but also not really necessary given some of your other options.

Also keep in mind that it's not an error to alias memory, it's a warning. I'm not saying go crazy with aliasing, but I am saying that as with all things C and C++, there are justifiable cases to break rules, when you have very thorough knowledge and understanding of both your compiler and the machine you're running on. Just be prepared for some very long and painful debug sessions if it turns out you didn't in fact know those two things as well as you thought you did.

Jon Reeves
  • 2,426
  • 3
  • 14