2

I am experimenting with the usage of flatbuffers in my company as a replacement for raw structs. The classes we need to serialize are fairly large and I have noticed that the overhead of flatbuffer serialization is more than we can take when running debug builds.

I replicated my finding with the following simple test program (the datatype is similar to the datatype in our production code):

#include "stdafx.h"
#include <flatbuffers/flatbuffers.h>
#include "footprints_generated.h"
#include <vector>
#include <iostream>
#include <chrono>

using namespace Serialization::Dummy::FakeFootprints;

flatbuffers::FlatBufferBuilder builder;

flatbuffers::Offset<XYZData> GenerateXYZ()
{
    return CreateXYZData(builder,
        1.0,
        2.0,
        3.0,
        4.0,
        5.0,
        6.0,
        7.0,
        8.0,
        9.0,
        10.0,
        11.0,
        12.0,
        13.0,
        14.0,
        15.0,
        16.0,
        17.0,
        18.0,
        19.0,
        20.0);
}

flatbuffers::Offset<Fake> GenerateFake()
{
    std::vector<flatbuffers::Offset<XYZData>> vec;
    for(int i = 0; i < 512; i++)
    {
        vec.push_back(GenerateXYZ());
    }

    auto XYZVector = builder.CreateVector(vec);

    return CreateFake(builder,
        1.0,
        2.0,
        3.0,
        4.0,
        5.0,
        6.0,
        7.0,
        8.0,
        9.0,
        10.0,
        XYZVector);
}

int main()
{
    auto start = std::chrono::steady_clock::now();

    for(auto i = 0; i < 1000; i++)
    {
        auto fake = GenerateFake();
    }

    auto end = std::chrono::steady_clock::now();
    auto diff = end - start;
    std::cout << std::chrono::duration <double, std::milli>(diff).count() << " ms" << std::endl;

    std::string dummy;
    std::cin >> dummy;
}

Which takes around 40 seconds to run on my pc in debug (approx. 400ms in release). I'm looking for any way to improve performance in the debug build. Profiling showed that most time is spent in std::vector code, so I tried setting _ITERATOR_DEBUG_LEVEL to zero, but that did not result in any significant performance increase.

Bouke
  • 104
  • 8

2 Answers2

1

Ran into the same problem again and decided to play with the compiler settings to see which ones have the most drastic effect. In case someone else ever stumbles across this post, here's what I found:

Started out with a sample application similar to that in the question. Runtime was approximately 40 seconds.

  • Enabling inline function for any suitable function (/Ob2): 12.5 seconds
  • No 'edit & continue' in the pdb (/Zi): 7.6 seconds
  • Omit basic runtime checks: 4.5 seconds
  • Disable iterator debugging (_HAS_ITERATOR_DEBUGGING=0): 2.2 seconds
  • Disable minimal rebuild (/Gm-): 1.6s seconds
  • Enable optimization for speed (/O2): 400 milliseconds

Of course this practically turns the configuration into the standard release configuration, but we were able to use a subset of these options to get flatbuffer performance to a point where it was no longer the bottleneck in our application.

Bouke
  • 104
  • 8
  • The fact that e.g. iterator debugging makes a difference shows you that a lot of the overhead is in the use of std::vector. It's going to spend 99% of its time in your 512x for loop, and the only FlatBuffers call in there (`GenerateXYZ`) does not make use of iterators, and is fairly basic code, surely not something that I'd expect taking a 100x hit in debug mode. It is true though, that much like the STL, FlatBuffers is optimized for speed in release mode, and it sacrifices speed in debug for clarity, by using plenty of layers of functions that normally all get inlined away. – Aardappel Nov 28 '17 at 03:10
  • "The fact that e.g. iterator debugging makes a difference shows you that a lot of the overhead is in the use of std::vector" There are no vectors in my test code anymore since I have removed them after previous comments. I now use `flatbuffers::Offset vec[512];` instead. As far as I could tell from profiling, most of the time was lost on the layering (as you said) and the push_backs in the TrackField function. – Bouke Nov 28 '17 at 07:33
  • Just in case this is helpful to you somehow, I dug into this some more and found that most of the iterator checking overhead comes from :_Orphan_range which is called from the push_back() (in TrackField) and the use of iterators in EndTable. – Bouke Nov 28 '17 at 07:59
  • Ahh you're right, forgot about that. Hmm, I suppose we could potentially hard-code that vector into an array for the sake of debug-mode speed. – Aardappel Nov 28 '17 at 15:27
  • Great, thanks. I'm interested to see how that will influence our load in debug builds. – Bouke Dec 29 '17 at 15:35
  • Here's the fix, let me know if it helps: https://github.com/google/flatbuffers/commit/79b80f84df7de0618596b293566062a3ea460958 – Aardappel Jan 11 '18 at 22:37
  • that's awesome.. you were originally 100x slower though, so the remaining 33x is in the above items that we can't do much about? – Aardappel Jan 15 '18 at 15:36
  • I would expect so, yes. I expected a performance increase of around 2x, in line with what I got when disabling iterator checking. So I was pleasantly surprised. Fortunately, I was able to tweak some of the settings for our debug build as well, so we will at most end up with a 10x difference between release and debug. – Bouke Jan 15 '18 at 20:10
0

I notice that you are using push_back() on a vector, but I don't see a call to reserve(). Therefore, your code is probably spending a lot o time doing heap allocation. I'd suggest you put in vec.reserve(512) before you enter the loop that calls GenerateXYZ().

Logicrat
  • 4,438
  • 16
  • 22
  • Thanks for the suggestion. Unfortunately, reserving did not lead to a significant change in performance. (Although you're of course right that it would be faster to prevent reallocation in the vector) – Bouke Mar 18 '16 at 13:31
  • Declare "vec" as a global variable, and then vec.clear() before you start doing the push_back(). – Sven Nilsson Mar 18 '16 at 15:21
  • yes, the STL is notoriously slow in debug. why not declare a flatbuffers::Offset vec[512] instead? Either way if the profile shows mostly std::vector, how is this a FlatBuffers problem? – Aardappel Mar 18 '16 at 17:14
  • Hi Guys, thanks for the answers. I should have been clearer on the profiling results: it pointed to std::vector operations called by flatbuffers code. Just to be sure I applied the change suggested by @Aardappel (replaced vector with a global array) and that didn't really make any difference in run time. Just to be clear: I'm not claiming that there are problems with flatbuffers, I'm just experiencing this issue (which will be a deal breaker for me using flatbuffers in production code) and looking for a way to overcome them. – Bouke Mar 22 '16 at 11:12
  • So it's still 100x slower in debug after removing the use of std::vector? That's certainly odd. Have you compared profiles of debug and release? Have you compared against another serializer (like e.g. protobuf)? – Aardappel Mar 22 '16 at 15:51
  • Yep, still approx. 100x slower. I did manage to get some significant improvement (execution time back from 40 to 6 seconds) by disabling stack frame checking code generation in debug. Setting _ITERATOR_DEBUG_LEVEL to zero brings those six seconds back to four (but that's not really feasible for me to use in production code). I would like to do the comparison to protobuf or more likely capt'n proto, but I'm not sure I can make the time for it at work right now. – Bouke Mar 23 '16 at 11:41