Is it possible for the assembler to have an impact on the code's performance?

Question

I know that the compiler can have a direct impact, but can the assembler also have any impact? I saw two sources that said the assembler can optimize by rearranging instructions to reduce clock cycles, and that it can also reduce overhead (redundant instructions). If the answer is yes, could you give me an example?

No sane assembler will rearrange the instructions. We use assembly to have complete control of the generated machine code. Assemblers can use shorter (in bytes) forms of an instruction (e.g. `mov rax, 1` is often assembled as `mov eax, 1`) because we don't usually care about the specific instruction form granted the semantic is the same. But there must exist a way to tell the assembler to disable such optimizations (e.g. `STRICT` for nasm). Some architectures also make use of pseudo instructions. The programmer is supposed to know what is a real instruction and what is not. — Margaret Bloom, Mar 10 '23 at 16:30
Can you please post links to your claims about assembler behaviour? — Weather Vane, Mar 10 '23 at 20:06
@MargaretBloom: MIPS assemblers will reorder instructions to fill the branch-delay slot, unless you `.set noreorder`. (IDK if any also rearrange to try to schedule dependencies for you). Even GAS does this, since classic MIPS assemblers do it. (https://sourceware.org/binutils/docs/as/MIPS-Option-Stack.html mentions reordering, but the GAS manual strangely doesn't seem to mention `.set noreorder` for MIPS; it does mention `.set reorder` for Alpha: https://sourceware.org/binutils/docs/as/Alpha-Directives.html). I haven't heard of any x86 assemblers doing that. — Peter Cordes, Mar 10 '23 at 20:32
[How to force NASM to encode \[1 + rax\*2\] as disp32 + index\*2 instead of disp8 + base + index?](https://stackoverflow.com/q/48848230) is an example of NASM's default of choosing the smallest encoding can hurt performance. Also, [Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?](https://stackoverflow.com/q/51664369) - `adc al, 0` using the short encoding that all assemblers use will be 2 uops on Broadwell/Skylake and maybe later, because Intel forgot(?) to make it use the same one uop as the 3-byte `adc r/m8, imm8` encoding. — Peter Cordes, Mar 10 '23 at 20:36
Different strategies for expanding `align 16` to NOPs will affect performance if you do it inside a function where it executes before a loop. AFAIK, no x86 assemblers are smart enough to lengthen existing instructions for `align` directives, only for [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) :/ NASM defaults to single-byte `nop` which is the worst thing possible, can even defeat the uop cache as well as taking more issue slots and ROB entries. You have to `%use smartalign` to get acceptable NOP padding. — Peter Cordes, Mar 10 '23 at 21:01
There's no formal definition of an assembler, so in the abstract, an assembler can do optimization, but assembly language communicates program semantics somewhat poorly so potential is limited, and, there's no market for it as Margaret says, we don't write in assembly with the idea that the assembler will transform. — Erik Eidt, Mar 11 '23 at 01:15
@PeterCordes Good point! I didn't know MIPS assemblers did that. — Margaret Bloom, Mar 11 '23 at 12:40

Is it possible for the assembler to have an impact on the code's performance?

0 Answers0