17

Trying out snmalloc on Macos I wondered why all the created binaries are >256MiB.

It turns out that zero-initialized static inline data members are lowered in a weird way on Mac OS X, on both ARM64 and x86_64. Even this simple test produces huge binaries:

container.h

#pragma once
#include <cstdint>

class Container {
    public:
        inline static uint8_t inner[256000000];
};

main.cc

#include "container.h"

int main() {
    return Container::inner[0];
}

Compiled like this:

$ ~/clang+llvm-12.0.0-x86_64-apple-darwin/bin/clang -O3 -std=c++17 main.cc --target=x86_64-apple-darwin -c; ls -l main.o
-rw-r--r--  1 hans  staff  256000744 Jun 21 16:29 main.o

It is the same with open-source clang as with Apple clang. gcc behaves similarly.

On Linux (compiled with either clang or gcc) it is included in the .bss section, thus not taking up any space.

Why is this the case on Macos? And is this a bug or expected behavior?

HHK
  • 4,852
  • 1
  • 23
  • 40
  • 2
    The language doesn't specify how data is stored in executables, only the visible effect of running the program. – Barmar Jun 21 '21 at 19:04
  • 1
    @Barmar I understand that, but even if it behaves according to spec it can still be a bug just like taking minutes for a simple `int x = y * y` would be considered a bug despite being spec-compliant. – HHK Jun 21 '21 at 19:10
  • 5
    This is often called "quality of implementation". An implementation that takes several minutes for a multiplication is not very useful, but unless they say they pass certain performance benchmarks, it's not necessarily a "bug". – Barmar Jun 21 '21 at 19:14
  • So is this a clang vs gcc thing? – Paul Sanders Jun 21 '21 at 19:23
  • @PaulSanders Not really. It works fine with clang on Linux and gcc on Linux. I haven't tested gcc on Macos. I assume it is a Macos-specifc or a Mach-O vs ELF thing. – HHK Jun 21 '21 at 19:26
  • 6
    FWIW it seems you can force this optimization by adding `__attribute__((section("__DATA,__bss")))` to the definition. – Siguza Jun 21 '21 at 19:42
  • Have you tried running [Bloaty McBloatface](https://github.com/google/bloaty) to see where the bloat is coming from? Also, and this may be a silly question, but have you compared with other similar programs to verify that `static inline` is the culprit as opposed to something else, like just `static` objects in general? – Human-Compiler Jun 24 '21 at 23:04
  • 3
    @Human-Compiler I have used `nm` to confirm that the static inline data member is placed into the data section and takes up 256MB on Macos, while it is in .bss on Linux. I have also confirmed that a `static uint8_t static_arr[256000000] = { 0 };` ends up in .bss on Macos. – HHK Jun 25 '21 at 04:13
  • gcc behaves similarly, so it does not seem to be bug in clang. – HHK Jun 26 '21 at 04:05
  • The statement that gcc and clang behave similarly on Mac OS X suggests that the problem may be in the linker. – Leon Jul 04 '21 at 18:22
  • 1
    @Leon I compiled with `-c`, so no linking involved. – HHK Jul 05 '21 at 09:20
  • @HHK are you sure that `static uint8_t static_arr[256000000] = { 0 };` is ending up in .bss and not simply being optimized away to 0? On my machine (Big Sur, clang 12), I get the same result either way (both end up in `DATA,__data`). You might want to double check the assembly output. – Jon Reeves Jul 05 '21 at 19:42

2 Answers2

9

I'll go ahead and take a stab at answering this, though I'll be the first to admit that you can only go so far with an answer before you run into a wall that says "because someone made a decision and you're stuck with it forever."

The primary key to all of this comes in the form of the Mach-O Runtime specification for MacOS, which defines the .bss section as being used for:

uninitialized static variables (for example, static int i;).

You can read about it in this archived version from version 10.3, but you can also find the same information in other Mach-O references.

The important thing to note here is that the use of bss refers to "private" symbols only. In other words, this refers to a C-style use of the static keyword, which is guaranteed to be local to the translation unit.

When you declare a C++17 member variable as static inline, despite the use of the perversely overloaded static keyword, you've created a global object, of which there is guaranteed to only ever be one instance in a program. In other words, every translation unit compiled with this declaration will instantiate it, and the linker will be expected to "coalesce" them into a single instance by picking one of them. This is obviously quite different from the C-style "uninitialized static variable."

MacOS host compilers like clang implement this by declaring the symbol as weak DATA, similar for example to how default constructors would be declared (though those would of course be in TEXT).

To illustrate this point, note that you could get the same effect without C++17 at all. For example compile these sets of examples this and look at the assembly output:

static uint8_t stuff[256000000]; // <- goes into .bss

int main() {
    return (int)reinterpret_cast<uint64_t>(&stuff[0]);
}

Note that I'm having to do the &stuff thing here to make sure the compiler doesn't optimize away stuff entirely in this case.

Now try this:

uint8_t stuff[256000000]; // <-- goes into __DATA,__common

int main() {
    return (int)reinterpret_cast<uint64_t>(&stuff[0]);
}

Getting closer. Note that stuff is not put into .bss like you might see on a linux platform. According again to the Mach-O runtime spec, the common section is used for:

Uninitialized imported symbol definitions (for example, int i;) located in the global scope (outside of a function declaration)."

Now try this:

__attribute__((weak)) uint8_t stuff[256000000]; // <-- in DATA,__data

int main() {
    return (int)reinterpret_cast<uint64_t>(&stuff[0]);
}

This is exactly how a static inline C++17 member variable will be defined. Deep under the hood, clang has assigned this symbol to be "coalesced" data, which on x86 just turns into standard DATA. If you really want to dive into the sausage factory, you can actually see that in the llvm SelectSectionForGlobal function.

   if (GO->isWeakForLinker()) {
     if (Kind.isReadOnly())
       return ConstTextCoalSection;
     if (Kind.isReadOnlyWithRel())
       return ConstDataCoalSection;
     return DataCoalSection;
   }

And DataCoalSection is correspondingly defined here to be identical to the ordinary data section on everything but power PC.

So from my perspective the behavior you're seeing is working as I would expect given the available specifications for the Mach-O runtime.

Jon Reeves
  • 2,426
  • 3
  • 14
-2

Try instantiate an object for the class and call the member from the object.

Container obj;
cout << obj.inner[0];
c3nt4ur1
  • 1
  • 1