Why don't we write assemblers and linkers that can handle C++ identifiers?

Question

My understanding of why we use name mangling is that assemblers and linkers can only handle C identifiers. "int foo::bar::baz<spam::eggs>(const MoreSpam&)" can't be used as a label by any existing assemblers, and existing linkers won't recognize it as a valid function signature, so it becomes something like "_ZN3foo3bar3bazIN4spam4eggsEEEiRK8MoreSpam", which is (more or less) a valid C identifier.

But this seems like a relatively trivial limitation of our tools. Is there any good reason why we can't or don't write an assembler and linker in which something like this:

int foo::bar::baz<spam::eggs>(MoreSpam const&):
    ; opcodes go here
    ret

is fine and allowed?

At the end of the day you can do whatever you like (TI and Weird Al Version will both be accepted), but somebody has to sit down and do the work. — user4581301, Jun 26 '20 at 15:03
The assembler would have to match the tools. Different systems mangle in different ways. — Thomas Jager, Jun 26 '20 at 15:04
Hm, not sure if I fully understand your question? If I declare `int foo::bar::baz(MoreSpam const&)`, I would need to be able to call it from a C-context. Which syntax would you use (in C)? — Mikael H, Jun 26 '20 at 15:07
@MikaelH in this hypothetical tool chain, the C function `int foo(int)` would be given the label `foo` in the assembly and object code, because C doesn't support scoping or function overloading. C++ functions would be treated similarly if declared in an `extern "C"` block. The point here is that the assemblers and linkers should be able to use straight C++ identifiers instead of mangled names, because why shouldn't they? — Dante Falzone, Jun 26 '20 at 15:11
Go does it like you say for example and IIRC if you escape symbol names properly, you can name your symbols whatever you like with the UNIX assembler. The main point of name mangling is that there are more than enough assemblers that don't support weirdly named symbols, so using more common symbol names makes it easier to implement C++ on platforms with less-than capable assemblers. — fuz, Jun 26 '20 at 15:16
I attempt a build to get the names from linker errors due to missing C++ mangled function names, then rename my assembly functions, such as: `?MatrixMpya@@YAXAEAVMATRIX@@00@Z proc ` . — rcgldr, Jun 26 '20 at 15:22
For an assembler, parsing is easier if you know that symbol names are strings without spaces, built from a character set that doesn't contain characters that could be confused with other syntactical elements. Consider a line like `mov reg, foo(int, int&) & 0xffff`. It's no longer the case that a comma always separates two instruction operands, and `&` may not always be the bitwise AND operator. Your assembler now needs to be able to parse arbitrary C++ type syntax, which is much more complicated than a typical assembler parser would otherwise need to be. — Nate Eldredge, Jun 26 '20 at 15:58
At the end of the day, having one assembler and linker for C and C++ is more profitable than having one assembler and linker for each language. Maintenance is not cheap. Most companies using software have found out that they can save development costs by sharing code or applications. — Thomas Matthews, Jun 26 '20 at 17:40
@ThomasMatthews I never meant to suggest that we'd have different assemblers and linkers for different languages, merely that the assemblers and linkers we *do* have shouldn't need to mangle C++ symbols in order to work. — Dante Falzone, Jun 26 '20 at 18:13

score 4 · Accepted Answer · answered Jun 26 '20 at 18:53

You can actually use int foo::bar::baz<spam::eggs>(const MoreSpam&) as an identifier with the GNU assembler, you just need to put the name in quotes:

"int foo::bar::baz<spam::eggs>(MoreSpam const&)":
        ret

$ as -o test.o test.s
$ nm test.o
0000000000000000 t int foo::bar::baz<spam::eggs>(MoreSpam const&)
$ ld test.o
ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
$ nm a.out
0000000000402000 T __bss_start
0000000000402000 T _edata
0000000000402000 T _end
0000000000401000 t int foo::bar::baz<spam::eggs>(MoreSpam const&)
                 U _start

One problem with this is that, aside from being a pain in a lot of contexts to deal with symbols with spaces and symbols in them, is that not all C++ mangled identifiers can be unambiguously represented a C++ source fragment. The same C++ "symbol" can have multiple mangled representations, some mangled symbols have no C++ representation.

For example, the Itanium C++ ABI used by the GNU C++ compiler defines 5 different ways of mangling the name of the same constructor depending on what variant of the constructor is generated by the compiler. Similarly there's three different ways to mangle the name of a given destructor. The symbols _ZN3fooC1Ev and _ZN3fooC2Ev both demangle as foo::foo() and both can exist in the same program.

Sure you can invent new C++-like syntax to represent these things, but then you're just inventing more verbose way of mangling symbols.

Finally, perhaps the most important reason why C++ compilers mangle the names the way they do is so they can work with all sort of tools. While it's much less common today, the GNU C++ compiler can be used with assemblers other than GAS.

Why don't we write assemblers and linkers that can handle C++ identifiers?

1 Answers1