5

Consider the following C++ program:

#include <deque>
#include <iostream>

using namespace std;

int main()
{
    deque<double> d(30000000);
    cout << "Done\n";
}

The memory allocation in the first line only takes a second, but after it prints Done, it takes 33 seconds (!) to exit back to the terminal. Decreasing the number of elements to 20000000 reduces that time to 22 seconds, so clearly it's linear in the number of elements.

I am compiling on Windows 10, and the same thing happened with both GCC 10.2.0 and Visual Studio 2019.

What's going on here? Am I using deque in a way it's not supposed to be used?

EDIT:

#include <deque>
#include <iostream>

using namespace std;

void test_deque()
{
    deque<double> d(30000000);
    cout << "Function done\n";
}

int main()
{
    test_deque();
    cout << "Main done\n";
}

Now it prints Function done and then there is the 33 second delay. So I assume this has to do with the destructor that gets executed when the function exits. But why does it take so long to destruct 240 MB of memory?

EDIT 2: Tried it (the second version) with GCC on Ubuntu and it only takes a fraction of a second to run! Same with some online C++ compilers. Is this a problem specific to Windows?

EDIT 3: With vector it also takes a fraction of a second to run. However, with list (and forward_list) I get a similar extremely long delay.

EDIT 4: Compiling with MSVC in Release (rather than Debug) configuration also takes a fraction of a second. I'm not sure what the GCC equivalent it, but with -O3 (max optimizations) the execution time remains 33 seconds.

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • 1
    You could investigate this further by wrapping it in a second function. That way you can see the difference in how long it takes the function to return vs. the program to exit. – Brent Bradburn Nov 09 '20 at 02:33
  • 3
    Cannot reproduce on Linux with gcc 10.2.1. The whole thing takes a fraction of a second to run. – Sam Varshavchik Nov 09 '20 at 02:39
  • 1
    I also cannot reproduce with Apple clang 11.0.3 – WBuck Nov 09 '20 at 02:43
  • 1
    Another thing to try: Other container types -- notably, `std::vector`, which is guaranteed to be a single contiguous memory allocation (easy to free). – Brent Bradburn Nov 09 '20 at 02:43
  • 1
    What is the memory of your machine and what is the usage rate? – lilucpp Nov 09 '20 at 02:45
  • 1
    I also tried this, it ran and existed under a second on windows – hookenz Nov 09 '20 at 02:46
  • 2
    Tried this on MSVC in Windows 10, the version that wraps it in a function. It took its sweet time even before printing "function done", and then a long lapse before printing "done". Switched from Debug x64 to Release x64 and it goes very fast. Something with optimization levels on Windows? – Nathan Pierson Nov 09 '20 at 02:47
  • 1
    Good point Nathan. Maybe debug version with preallocation on windows and filling the memory with fixed values might take a while. – hookenz Nov 09 '20 at 02:48
  • @lilucpp: I have 32 GB (currently around 11 GB in use) and the program only seems to use 305 MB. –  Nov 09 '20 at 02:49
  • 1
    You could also try initializing the array. Allocating, initializing, or de-allocating 240MB shouldn't take that long on a healthy modern machine -- but it's useful to pinpoint where the time is taken. – Brent Bradburn Nov 09 '20 at 02:49
  • @NathanPierson: I got the same as you on MSVC (see edit 4). But max optimizations on GCC did not change anything... This is very weird... –  Nov 09 '20 at 02:55
  • @BrentBradburn: This is definitely not normal behavior, initializing a `vector` or even a C-style array of the same size takes a fraction of a second. I think this may be some kind of bug. –  Nov 09 '20 at 02:56
  • 1
    Some real spice: When I compile this with `g++ -o cpp_scratch CppScratchwork.cpp` using g++ 9.3.0 on Ubuntu 20.04 WSL v1 on this same windows 10 machine... It runs almost instantly. ETA even compiling with `-O0` set it takes a fraction of a second in WSL. Looking increasingly like some weird Win10 bug. – Nathan Pierson Nov 09 '20 at 02:59
  • I think it's a case of MSVC filling the allocated memory with a fixed value like 0xCD. It's used to check for memory buffer overruns etc. That might why the release version isn't as slow. Although I have to say, it's slower than under G++ on windows. Probably a poor implementation of deque – hookenz Nov 09 '20 at 03:02
  • Debug mode on msvc will be slow because of the heap corruption testing. – drescherjm Nov 09 '20 at 03:03
  • `std::list` and `std::deque` both have linear complexity destructors, so I'm not sure what is surprising here. Since the container is not used, it can be optimized out. Either a slow destruction or no destruction are both possible results. – François Andrieux Nov 09 '20 at 03:04
  • OK so using VS 2019 version 16.7.2 (outdated) it took 3.4 seconds to run the application. I compiled in `x64` using `O2` optimization. – WBuck Nov 09 '20 at 03:04
  • Take a look at the deque implementation. Since it's a template you'll be able to see what it's doing. – hookenz Nov 09 '20 at 03:06
  • Since only 305 MB is allocated there is something wrong with the test, try to set every 512 element to a new value and check the memory usage and timings. – Surt Nov 09 '20 at 17:03

4 Answers4

2

Fundamentally the answer isn't very interesting. Your program is a no-op, so a compiler may optimize out the deque construction. But it doesn't have to.

But first, a legal sane implementation may do any of the following:

  1. Do an allocation for 30000000 float elements and nothing else. The allocator might:

    1. Do the allocation in the most lazy way, doing essentially nothing but some bookkeeping.
    2. Eagerly allocate and page in memory, causing 30000000/page size operations.
    3. Zero-initialize or pattern-initialize (e.g. 0xdeadbeef) to help detect uninitialized usage the memory, causing 30000000 writes.
  2. Allocate (include above) and zero-initialize or pattern-initialize the memory.

  3. Run some sort of destructor over all elements (e.g. zeroing out memory).

  4. Not run a destructor on any elements since it's a built-in type.

Now all of the above are possible options. And since your program is a no-op, a legal compiler may optimize out any or none of these steps. Your system allocator might vary in capabilities, supporting lazy allocation, overcommit, automatic zeroing, etc. So the end result is that you could get any kind of behavior depending on your operating system, compiler version, compiler flags, standard library, etc.

orlp
  • 112,504
  • 36
  • 218
  • 315
  • True, but not really helpful for the OP. – Passer By Nov 09 '20 at 03:38
  • 1
    @PasserBy Why not? The OP asked 'why is this happening' (which I did my best to explain - it could be many reasons). OP did not ask 'how do I stop this from happening'. – orlp Nov 09 '20 at 03:40
  • 1
    I think one thing that is missing is that the destructors of both containers mentioned (`std::list` and `std::deque`) have linear complexity desctruction. So it is not surprising that destruction can take a long time. – François Andrieux Nov 09 '20 at 03:55
  • On second thought, you might be right. Initially I would've thought the OP is asking about the other two answers. – Passer By Nov 09 '20 at 06:24
  • Thanks, but... First of all, this example is a no-op because it's a minimal example, but when I add some operation (e.g. a `for` loop that does something with the `deque`) it still runs in less than a second in MSVC Release mode and very slow in any other MSVC or GCC mode that I tried. Second, even with all of the items on your list, it still doesn't make sense that it takes 33 seconds to destruct 240 MB of memory! believe this isn't actually the intended behavior, and may indicate some kind of bug. –  Nov 09 '20 at 13:44
0

It is really slow in debug mode.

MSDN:

Processes that the debugger creates (also known as spawned processes) behave a little differently than processes that the debugger does not create.

Instead of using the standard heap API, processes that the debugger creates use a special debug heap. You can force a spawned process to use the standard heap instead of the debug heap by using the _NO_DEBUG_HEAP environment variable.

lilucpp
  • 196
  • 1
  • 10
  • After setting this environment variable, the program still runs very slowly on MSVC debug mode. Also, this doesn't explain why it's so slow on GCC. –  Nov 09 '20 at 13:49
0

std::deque allocates data in chunks of fixed size varying with platform and type. For double that could be 4KB. So allocating 30,000,000 doubles takes 2.4GB of memory and thus 6,000 allocations/deallocations. With std::list that would be 30,000,000 allocations/deallocations and take a couple GB more memory making it all ridiculously slow.

This might even cause memory fragmentation issues depending on your hardware. And if you run without optimizations it will be even slower.

There is also the privacy issue. Deallocation might be clearing data to ensure that your program didn't leak any information to outside programs.

As mentioned by @orlp as your program is a no-op the whole allocation/deallocation can be optimized out completely which might explain why it speeds up significantly at times.

ALX23z
  • 4,456
  • 1
  • 11
  • 18
  • 1
    When I add some operations on the `deque` (so it doesn't get optimized away) the code still runs in less than a second in MSVC Release mode and very slowly in any other MSVC or GCC mode that I tried. Also, the program only uses at most 305 MB according to Task Manager. Then, during the 33 seconds, you can see that amount being VERY slowly reduced. It shouldn't take this long to destruct this little memory, especially not with 32 GB of DDR4 memory at 3,600 MHz. I believe this isn't actually the intended behavior, and may indicate some kind of bug. –  Nov 09 '20 at 13:57
  • @QuantumField there is no bug. Sometimes OS systems have weird tricks with memory mapping. For instance, memory might properly be assigned only after first access and won't appear on Task Manager. Yeah, it was a calc error there was an extra zero. The issue is with lots of small fragmented allocations which OSs memory's allocators hate. – ALX23z Nov 09 '20 at 16:21
  • the reason I posted this question is that I was writing a program which used a large `deque`, and I noticed it was taking a very long time to close. I am compiling my program with GCC because I want to compile it with the same compiler in both Windows and Linux. GCC optimizations don't seem to change anything. So essentially you're telling me my program will always take 33 seconds to close unless I compile it with MSVC release mode instead of GCC? It just seems like that's a bug in GCC (or at least the Mingw-w64 port) rather than intended behavior. –  Nov 09 '20 at 16:28
  • @QuantumField why do you need a large `deque`? Why not simply use `std::vector`? If you simply need `pop_front` you could use a contiguous deque. Surely you can find a good implementation of such a container online. – ALX23z Nov 09 '20 at 16:38
0

MSVC has a built-in profiler. We can run it (press Alt-F2) to see that the majority of CPU time is spent inside the constructor and destructor, which invoke deque::resize() and deque::_Tidy() functions, respectively.

msvc profiler

If we drill down further we see that deque::emplace_back() is resulting in quite a lot of code

#define _PUSH_BACK_BEGIN                                                                                \
    if ((_Myoff() + _Mysize()) % _DEQUESIZ == 0 && _Mapsize() <= (_Mysize() + _DEQUESIZ) / _DEQUESIZ) { \
        _Growmap(1);                                                                                    \
    }                                                                                                   \
    _Myoff() &= _Mapsize() * _DEQUESIZ - 1;                                                             \
    size_type _Newoff = _Myoff() + _Mysize();                                                           \
    size_type _Block  = _Getblock(_Newoff);                                                             \
    if (_Map()[_Block] == nullptr) {                                                                    \
        _Map()[_Block] = _Getal().allocate(_DEQUESIZ);                                                  \
    }

#define _PUSH_BACK_END ++_Mysize()

    template <class... _Valty>
    decltype(auto) emplace_back(_Valty&&... _Val) {
        _Orphan_all();
        _PUSH_BACK_BEGIN;
        _Alty_traits::construct(
            _Getal(), _Unfancy(_Map()[_Block] + _Newoff % _DEQUESIZ), _STD forward<_Valty>(_Val)...);
        _PUSH_BACK_END;

#if _HAS_CXX17
        return back();
#endif // _HAS_CXX17
    }

Disassembly view:

    template <class... _Valty>
    decltype(auto) emplace_back(_Valty&&... _Val) {
00007FF674A238E0  mov         qword ptr [rsp+8],rcx  
00007FF674A238E5  push        rbp  
00007FF674A238E6  push        rdi  
00007FF674A238E7  sub         rsp,138h  
00007FF674A238EE  lea         rbp,[rsp+20h]  
00007FF674A238F3  mov         rdi,rsp  
00007FF674A238F6  mov         ecx,4Eh  
00007FF674A238FB  mov         eax,0CCCCCCCCh  
00007FF674A23900  rep stos    dword ptr [rdi]  
00007FF674A23902  mov         rcx,qword ptr [rsp+158h]  
00007FF674A2390A  lea         rcx,[__0657B1E2_deque (07FF674A3E02Fh)]  
00007FF674A23911  call        __CheckForDebuggerJustMyCode (07FF674A21159h)  
        _Orphan_all();
00007FF674A23916  mov         rcx,qword ptr [this]  
00007FF674A2391D  call        std::deque<double,std::allocator<double> >::_Orphan_all (07FF674A217FDh)  
        _PUSH_BACK_BEGIN;
00007FF674A23922  mov         rcx,qword ptr [this]  
00007FF674A23929  call        std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)  
00007FF674A2392E  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23935  mov         rcx,qword ptr [this]  
00007FF674A2393C  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23941  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23948  mov         rcx,qword ptr [rcx]  
00007FF674A2394B  add         rcx,qword ptr [rax]  
00007FF674A2394E  mov         rax,rcx  
00007FF674A23951  xor         edx,edx  
00007FF674A23953  mov         ecx,2  
00007FF674A23958  div         rax,rcx  
00007FF674A2395B  mov         rax,rdx  
00007FF674A2395E  test        rax,rax  
00007FF674A23961  jne         std::deque<double,std::allocator<double> >::emplace_back<>+0D0h (07FF674A239B0h)  
00007FF674A23963  mov         rcx,qword ptr [this]  
00007FF674A2396A  call        std::deque<double,std::allocator<double> >::_Mapsize (07FF674A214BFh)  
00007FF674A2396F  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23976  mov         rcx,qword ptr [this]  
00007FF674A2397D  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23982  mov         rax,qword ptr [rax]  
00007FF674A23985  add         rax,2  
00007FF674A23989  xor         edx,edx  
00007FF674A2398B  mov         ecx,2  
00007FF674A23990  div         rax,rcx  
00007FF674A23993  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A2399A  cmp         qword ptr [rcx],rax  
00007FF674A2399D  ja          std::deque<double,std::allocator<double> >::emplace_back<>+0D0h (07FF674A239B0h)  
00007FF674A2399F  mov         edx,1  
00007FF674A239A4  mov         rcx,qword ptr [this]  
00007FF674A239AB  call        std::deque<double,std::allocator<double> >::_Growmap (07FF674A21640h)  
00007FF674A239B0  mov         rcx,qword ptr [this]  
00007FF674A239B7  call        std::deque<double,std::allocator<double> >::_Mapsize (07FF674A214BFh)  
00007FF674A239BC  mov         rax,qword ptr [rax]  
00007FF674A239BF  lea         rax,[rax+rax-1]  
00007FF674A239C4  mov         qword ptr [rbp+0F8h],rax  
00007FF674A239CB  mov         rcx,qword ptr [this]  
00007FF674A239D2  call        std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)  
00007FF674A239D7  mov         qword ptr [rbp+100h],rax  
00007FF674A239DE  mov         rax,qword ptr [rbp+100h]  
00007FF674A239E5  mov         rax,qword ptr [rax]  
00007FF674A239E8  mov         qword ptr [rbp+108h],rax  
00007FF674A239EF  mov         rax,qword ptr [rbp+0F8h]  
00007FF674A239F6  mov         rcx,qword ptr [rbp+108h]  
00007FF674A239FD  and         rcx,rax  
00007FF674A23A00  mov         rax,rcx  
00007FF674A23A03  mov         rcx,qword ptr [rbp+100h]  
00007FF674A23A0A  mov         qword ptr [rcx],rax  
00007FF674A23A0D  mov         rcx,qword ptr [this]  
00007FF674A23A14  call        std::deque<double,std::allocator<double> >::_Myoff (07FF674A2139Dh)  
00007FF674A23A19  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23A20  mov         rcx,qword ptr [this]  
00007FF674A23A27  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23A2C  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23A33  mov         rcx,qword ptr [rcx]  
00007FF674A23A36  add         rcx,qword ptr [rax]  
00007FF674A23A39  mov         rax,rcx  
00007FF674A23A3C  mov         qword ptr [_Newoff],rax  
00007FF674A23A40  mov         rdx,qword ptr [_Newoff]  
00007FF674A23A44  mov         rcx,qword ptr [this]  
00007FF674A23A4B  call        std::deque<double,std::allocator<double> >::_Getblock (07FF674A21334h)  
00007FF674A23A50  mov         qword ptr [_Block],rax  
00007FF674A23A54  mov         rcx,qword ptr [this]  
00007FF674A23A5B  call        std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)  
00007FF674A23A60  mov         rax,qword ptr [rax]  
00007FF674A23A63  mov         rcx,qword ptr [_Block]  
00007FF674A23A67  cmp         qword ptr [rax+rcx*8],0  
00007FF674A23A6C  jne         std::deque<double,std::allocator<double> >::emplace_back<>+1D7h (07FF674A23AB7h)  
00007FF674A23A6E  mov         rcx,qword ptr [this]  
00007FF674A23A75  call        std::deque<double,std::allocator<double> >::_Getal (07FF674A216CCh)  
00007FF674A23A7A  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23A81  mov         edx,2  
00007FF674A23A86  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23A8D  call        std::allocator<double>::allocate (07FF674A216C7h)  
00007FF674A23A92  mov         qword ptr [rbp+100h],rax  
00007FF674A23A99  mov         rcx,qword ptr [this]  
00007FF674A23AA0  call        std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)  
00007FF674A23AA5  mov         rax,qword ptr [rax]  
00007FF674A23AA8  mov         rcx,qword ptr [_Block]  
00007FF674A23AAC  mov         rdx,qword ptr [rbp+100h]  
00007FF674A23AB3  mov         qword ptr [rax+rcx*8],rdx  
        _Alty_traits::construct(
00007FF674A23AB7  mov         rcx,qword ptr [this]  
00007FF674A23ABE  call        std::deque<double,std::allocator<double> >::_Map (07FF674A21753h)  
00007FF674A23AC3  mov         rax,qword ptr [rax]  
00007FF674A23AC6  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23ACD  xor         edx,edx  
00007FF674A23ACF  mov         rax,qword ptr [_Newoff]  
00007FF674A23AD3  mov         ecx,2  
00007FF674A23AD8  div         rax,rcx  
00007FF674A23ADB  mov         rax,rdx  
00007FF674A23ADE  mov         rcx,qword ptr [_Block]  
00007FF674A23AE2  mov         rdx,qword ptr [rbp+0F8h]  
00007FF674A23AE9  mov         rcx,qword ptr [rdx+rcx*8]  
00007FF674A23AED  lea         rax,[rcx+rax*8]  
00007FF674A23AF1  mov         rcx,rax  
00007FF674A23AF4  call        std::_Unfancy<double> (07FF674A214A6h)  
00007FF674A23AF9  mov         qword ptr [rbp+100h],rax  
00007FF674A23B00  mov         rcx,qword ptr [this]  
00007FF674A23B07  call        std::deque<double,std::allocator<double> >::_Getal (07FF674A216CCh)  
00007FF674A23B0C  mov         qword ptr [rbp+108h],rax  
00007FF674A23B13  mov         rdx,qword ptr [rbp+100h]  
00007FF674A23B1A  mov         rcx,qword ptr [rbp+108h]  
00007FF674A23B21  call        std::_Default_allocator_traits<std::allocator<double> >::construct<double> (07FF674A211E5h)  
            _Getal(), _Unfancy(_Map()[_Block] + _Newoff % _DEQUESIZ), _STD forward<_Valty>(_Val)...);
        _PUSH_BACK_END;
00007FF674A23B26  mov         rcx,qword ptr [this]  
00007FF674A23B2D  call        std::deque<double,std::allocator<double> >::_Mysize (07FF674A211B8h)  
00007FF674A23B32  mov         qword ptr [rbp+0F8h],rax  
00007FF674A23B39  mov         rax,qword ptr [rbp+0F8h]  
00007FF674A23B40  mov         rax,qword ptr [rax]  
00007FF674A23B43  inc         rax  
00007FF674A23B46  mov         rcx,qword ptr [rbp+0F8h]  
00007FF674A23B4D  mov         qword ptr [rcx],rax  

#if _HAS_CXX17
        return back();
00007FF674A23B50  mov         rcx,qword ptr [this]  
00007FF674A23B57  call        std::deque<double,std::allocator<double> >::back (07FF674A2127Bh)  
#endif // _HAS_CXX17
    }
00007FF674A23B5C  lea         rsp,[rbp+118h]  
00007FF674A23B63  pop         rdi  
00007FF674A23B64  pop         rbp  
00007FF674A23B65  ret

Apparently std::deque doesn't pre-allocate elements, and instead uses a loop to add them one-by-one. So no wonder it is slow.

You can speed up the Debug build by enabling some optimizations (e.g. /Ob1) and reducing runtime checks (e.g. remove /RTC1).

But really, std::deque is just a horrible structure from a performance point of view (a vector of tiny vectors - not cache-friendly at all).

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • Thank you for the very comprehensive answer! I accepted it because it is the best answer I got so far, but what I still don't understand is why this code runs so slowly when compiling with GCC on Windows, **even with maximum optimizations**, while with GCC on Linux it only takes a fraction of a second to run even without any optimizations. My guess is that perhaps the Mingw-w64 port uses the deque implementation from MSVC, and somehow it does so in a way that is not amenable to any optimizations. In any case, my takeaway from this is that I simply should never use deque in my code... –  Nov 11 '20 at 15:30
  • 1
    The Mingw issue might be a different issue entirely. Me, like many others, couldn't reproduce your issue with Mingw, for me it takes 0.02s even without optimizations (I have gcc 7.1.0 x86_64-posix-seh). If you continue experiencing issues with Mingw, then you might want to post a new question specifically about that (for example, comparing GCC Mingw and Linux performance). – rustyx Nov 11 '20 at 16:18
  • I'm using gcc version 10.2.0 (Rev4, Built by MSYS2 project). I believe this may be a bug introduced some time between 7.1.0 and 10.2.0... However, I'm satisfied with the solution of compiling with MSVC, so for now I'll leave it be. Thanks. –  Nov 11 '20 at 23:16