0

Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,

By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,

My simple code is as follows:

#include <stdio.h>
#include <iostream>



int main(int argc, char const *argv[])
{
    int x{5};
    int y{0};

    // mov %%eax, %0

asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
    ".byte 0x89, 0xC0\n\t"
    : "=r" (y)
    : "r" (x)

   );



   printf ("dst value : %d\n", y);

    return 0;
} 

and when I use objdump to analyze how it is broken down to machine language, i get the following output:

000000000000078a <main>:
 78a:    55                       push   %ebp
 78b:    48                       dec    %eax
 78c:    89 e5                    mov    %esp,%ebp
 78e:    48                       dec    %eax
 78f:    83 ec 20                 sub    $0x20,%esp
 792:    89 7d ec                 mov    %edi,-0x14(%ebp)
 795:    48                       dec    %eax
 796:    89 75 e0                 mov    %esi,-0x20(%ebp)
 799:    c7 45 f8 05 00 00 00     movl   $0x5,-0x8(%ebp)
 7a0:    c7 45 fc 00 00 00 00     movl   $0x0,-0x4(%ebp)
 7a7:    8b 45 f8                 mov    -0x8(%ebp),%eax
 7aa:    8b 45 f8                 mov    -0x8(%ebp),%eax
 7ad:    89 c0                    mov    %eax,%eax
 7af:    89 45 fc                 mov    %eax,-0x4(%ebp)
 7b2:    8b 45 fc                 mov    -0x4(%ebp),%eax
 7b5:    89 c6                    mov    %eax,%esi
 7b7:    48                       dec    %eax
 7b8:    8d 3d f7 00 00 00        lea    0xf7,%edi
 7be:    b8 00 00 00 00           mov    $0x0,%eax
 7c3:    e8 78 fe ff ff           call   640 <printf@plt>
 7c8:    b8 00 00 00 00           mov    $0x0,%eax
 7cd:    c9                       leave  
 7ce:    c3                       ret    

With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Weez Khan
  • 137
  • 1
  • 8
  • 1
    You are saying `"r" (x)`, which means that before invoking your asm instructions, gcc must move the value of `x` into a register. Then it executes the contents of your asm instruction, which does it again. – David Wohlferd Mar 24 '20 at 00:55
  • Inline assembler is almost always a mistake. If you are confident in your skills as a programming language designer and implementer, fire ahead. If you have the least worry about that, consider yourself employable, and choose another solution. – mevets Mar 24 '20 at 01:28

1 Answers1

1

One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.

Your code has no business assuming anything about the contents of memory or where EBP points.

Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.

// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
    : "=a" (y)
    : "a" (x)
   );

There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.

If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.


Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.

If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:

asm (".byte 0x8b,0x45,0xf8    # 0 = %0   1 = %1  \n\t"
    ".byte 0x89, 0xC0\n\t"
    : "=r" (y)
    : "r" (x)
   );

and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.

# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
        .string "dst value : %d\n"
foo():
        subl    $20, %esp       #,
        movl    $5, %eax        #, tmp84
 ## Notice that GCC hasn't set up EBP at all before it runs your asm,
 ## and hasn't stored x in memory.
 ## It only put it in a register like you asked it to.
        .byte 0x8b,0x45,0xf8   # 0 = %eax   1 = %eax    # y, tmp84
        .byte 0x89, 0xC0

        pushl   %eax  # y
        pushl   $.LC0 #
        call    printf  #
        addl    $28, %esp       #,
        ret

Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.

You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847