Very slow destructors of binary_oarchive and filtering_streambuf

Question

I noticed that destructors of binary_oarchive and filtering_streambuf are extremely slow and I don't know why. I'm running the following code:

namespace bfs = boost::filesystem;
namespace bar = boost::archive;
namespace bio = boost::iostreams;

using clock = std::chrono::steady_clock;
using time_point = std::chrono::time_point<clock>;
using duration = std::chrono::duration<double>;

// main code:

time_point start, now;
duration diff;

bfs::ofstream file(path, std::ios_base::out | std::ios_base::binary);

{
    bio::filtering_streambuf<bio::output> stream;
    stream.push(bio::gzip_compressor());
    stream.push(file);

    {
        bar::binary_oarchive oarch(stream);

        start = clock::now();
        oarch << BIG_OBJECT; // results in around 2GiB of output
        now = clock::now();
        diff = now - start;
        std::cout << "Writing took " << diff.count() << " seconds." << std::endl;
        
        start = now;
    } // 'oarch' destructor should run
    
    now = clock::now();
    diff = now - start;
    std::cout << "'oarch' destructor took " << diff.count() << " seconds." << std::endl;
    
    start = now;
} // 'stream' destructor should run

now = clock::now();
diff = now - start;
std::cout << "'stream' destructor took " << diff.count() << " seconds." << std::endl;

The output that I'm getting is:

Writing took 1709.93 seconds.             // around 28 minutes
'oarch' destructor took 2226.82 seconds.  // around 37 minutes
'stream' destructor took 2177.07 seconds. // around 36 minutes

I run md5sum on the output file every 10 seconds to see if anything changed:

after the writing has finished, the output file was around 2GiB in size,
while the 'oarch' destructor was running, the contents and the size of the output file didn't change (the md5 hash was the same the entire time),
after the 'oarch' destructor finished, the file grew only by around 2KiB,
after the 'stream' destructor finished, the file was the same as when the 'oarch' destructor finished (the same size and md5 hash).

So, why are those destructors so slow, even though barely anything happens to the output file after it has been written to?

A possibly important point: the entire code uses 8GiB of RAM and ~70GiB of swap.

Could be related: https://stackoverflow.com/questions/1753469/how-to-hook-up-boost-serialization-iostreams-to-serialize-gzip-an-object-to and https://stackoverflow.com/questions/15143149/in-boost-iostream-filtering-ostream-what-is-the-difference-between-sync-stri — Ehtn8, Oct 03 '22 at 17:09

sehe · Answer 1 · 2022-10-04T12:38:48.337

Three steps:

First: Naively Reproducing

I've simplified your code and made it self contained:

Live On Coliru

#include <boost/archive/binary_oarchive.hpp>
#include <boost/filesystem.hpp>
#include <boost/filesystem/fstream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/serialization/vector.hpp>
#include <chrono>
#include <iostream>
#include <random>
namespace bfs = boost::filesystem;
namespace bar = boost::archive;
namespace bio = boost::iostreams;

using C = std::chrono::steady_clock;
using namespace std::chrono_literals;

#ifdef COLIRU
static auto constexpr NUM_ELEMENTS = 600'000;
#else
static auto constexpr NUM_ELEMENTS = 600'000'000;
#endif

int main() {
    // main code:
    auto measurement = [start = C::now()](auto msg) mutable {
        auto now_ = C::now();
        auto diff = now_ - std::exchange(start, now_);
        if (diff < 1s)
            std::cout << msg << ": " << (diff) / 1.0ms << "ms\n";
        else
            std::cout << msg << ": " << (diff) / 1.0s << "s\n";
    };

    std::vector<float>                    BIG_OBJECT;
    std::mt19937                          urbg{std::random_device{}()};
    std::uniform_real_distribution<float> dist;
    std::generate_n(back_inserter(BIG_OBJECT), NUM_ELEMENTS,
                    std::bind(dist, std::ref(urbg)));
    measurement("Generation");

    bfs::path     path = "output.bin";
    bfs::ofstream file(path, std::ios_base::out | std::ios_base::binary);

    {
        bio::filtering_streambuf<bio::output> stream;
        stream.push(bio::gzip_compressor());
        stream.push(file);

        {
            bar::binary_oarchive oarch(stream);
            measurement("Archive open");

            oarch << BIG_OBJECT; // results in around 2GiB of output
            measurement("Writing");
        }

        measurement("oarch dtor");
    }

    measurement("stream dtor");
}

Which unsurprisingly prints (online):

Generation: 26.4354ms
Archive open: 0.252127ms
Writing: 175.871ms
oarch dtor: 0.039135ms
stream dtor: 0.141639ms

And when run locally with the 2.1GiB output.bin:

Generation: 5.18017s
Archive open: 2.04688ms
Writing: 104.039s
oarch dtor: 0.040114ms
stream dtor: 0.234694ms

What gives?

Second: Object Tracking

Having some experience with the library, I can suspect that your huge object is actually an object graph with types that engage the archive's Object Tracking. This feature enables aliasing pointers, even in the case of shared pointers etc. (So if your graph contains the same object in several nodes, it will only be serialized once).

So, replacing BIG_OBJECT with a vector of custom types:

struct MyValue {
    float wrapped_value;
    explicit MyValue(float v) : wrapped_value(v) {}
    void serialize(auto& ar, unsigned) { ar & wrapped_value; }
};

// later, generating:
std::vector<std::unique_ptr<MyValue>> BIG_OBJECT;
std::mt19937                          urbg{std::random_device{}()};
std::uniform_real_distribution<float> dist;
std::generate_n(back_inserter(BIG_OBJECT), NUM_ELEMENTS,
                [&] { return std::make_unique<MyValue>(dist(urbg)); });

This NUM_ELEMENTS of 90Mio is about the largest I can run on my 32GiB system. This is due to bookkeeping overhead for object tracking in the archive. Output: (compare Live On Coliru)

Generation: 3.29902s
Archive open: 3.14596ms
Writing: 180.931s
oarch dtor: 5.58313s
stream dtor: 2.04801s

As you can see, the serialization time is dominated by allocation overhead for object tracking:

and the destructor (completing the archive) deals with deallocation:

The stream's destructor dose a close, which does zlib_base::reset flushing the compression, which apparently takes significant time for this 529MiB archive (859 MiB uncompressed):

Third: Workaround/Fix

I'd say it's really unlikely for your object graph to be so finely-grained and tangled that object tracking overhead becomes the bottle-neck.

This tells me that in reality you might not care about object tracking for your type. You can signal this:

BOOST_CLASS_TRACKING(MyValue, boost::serialization::track_never)

With only that modification, the same code now runs in ~38% the time:

(compare Live On Coliru)

Generation: 3.3158s
Archive open: 0.339922ms
Writing: 68.4985s
oarch dtor: 0.028332ms
stream dtor: 0.254213ms

Most importantly, there's no more overhead tracking objects slowing the archive's completion.

If you want to disable object tracking whole-sale, use the archive flag for that:

bar::binary_oarchive oarch(stream, bar::no_tracking);

CAUTION Only do this when you know what the implication are for your application. It is entirely possible to break your serialization semantics when this improperly.

Again, compare Coliru.

Very slow destructors of binary_oarchive and filtering_streambuf

1 Answers1

First: Naively Reproducing

Second: Object Tracking

Third: Workaround/Fix

Linked