1

I created a simple c++ source file with the following code:

    int main() {
    int a = 1;
    int b = 2;
    if(a < b) {
        return 1;
    }
    else if(a > b) {
        return 2;
    }
    else {
        return 3;
    }
}

I used the objdump command to get the assembly code for the above source code.
The line int b = 2; got converted into mov DWORD PTR [rbp-0x4], 0x2.
Its corresponding machine code is C7 45 FC 02 00 00 00 (hex format).

I would like to know how I can convert assembly code into binary code. I went through the Intel Reference Manual for x86-64, but I was not able to understand it, since I am new to low level programming.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
Abhisheyk Deb
  • 35
  • 1
  • 6
  • 1
    What do you mean by 'convert'? Using a program? Doing it manually? – Shiro May 24 '17 at 14:57
  • Converting it manually. – Abhisheyk Deb May 24 '17 at 14:58
  • 1
    `int b = 2;` is NOT Assembly language. The difference is, that C is compiled language, so the line `int b = 2;` may be implemented in many different ways (even removed completely by optimizer), depending on what compiler will decide, how to produce machine code which will produce results as defined by C language standard. Assembly language is different in a way, that Assembler is not compiler of this kind, when you write in Assembly `add rax,rbx`, it will be compiled as that, not changing the instruction, or removing by some kind of optimizer, so that's more like "1:1 transformation". – Ped7g May 24 '17 at 15:07

1 Answers1

6

You should read the Intel manuals, it explains how to do that. For a simpler reference, read this. The way x86 instructions are encoded is fairly straightforward, but the number of possibilities can be a bit overwhelming.

In a nutshell, an x86 instruction comprises the following parts, where every part except the opcode may be missing:

prefix opcode operands immediate

The prefix field may modify the behaviour of the instruction, which doesn't apply to your use case. You can look up the opcode in a reference (I like this one), for example, mov r/m32, imm32 is C7 /0 which means: The opcode is C7 and one of the two operands is zero, encoding an extended opcode. The instruction thus has the form

C7 /0 imm32

The operand/extended opcode is encoded as a modr/m byte with an optional sib (scale index base) byte for some addressing modes and an optional 8 bit or 32 bit displacement. You can look up what value you need in the reference. So in your case, you want to encode a memory operand [rbp] with a one byte displacement and a register operand of 0, leading to the modr/m byte 45. So the encoding is:

C7 45 disp8 imm32

Now we encode the 8 bit displacement in two's complement. -4 corresponds to FC, so this is

C7 45 FC imm32

Lastly, we encode the 32 bit immediate, which you want to be 2. Note that it is in little endian:

C7 45 FC 02 00 00 00

And that's how the instruction is encoded.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • So int the the [link](http://ref.x86asm.net/geek64.html) you provided, I went to C7 1 Byte Opcode, and it is for **MOV** instruction. but what does the two operand Evqp and Ivds mean, do they correspond to rm32 addressing and immediate? Thank you for the help – Abhisheyk Deb May 24 '17 at 15:41
  • See [this page](http://ref.x86asm.net/#column_op) for the meaning of the fields. The reference I linked is highly condensed but more difficult to read. – fuz May 24 '17 at 17:52
  • Okay great. Another question was that you said "I wanted to encode rbp register with 1 byte displacement(8 bits)[**DWORD PTR [rbp-0x4]**]" , when I was seeing the table for MOD r/m in the [link](http://ref.x86asm.net/geek64.html#modrm_byte_32_64) I am also seeing a version of 32 bits displacements, can you give me an example of that? – Abhisheyk Deb May 24 '17 at 18:13
  • @AbhisheykDeb With a 32 bit displacement, the encoding would be `C7 85 FC FF FF FF 02 00 00 00` where the modr/m byte is `85` instead of `45` and the displacement is `FC FF FF FF`. – fuz May 24 '17 at 18:36
  • So if **mov rbp,rsp** gets converted to **48 89 e5**, I get it that since both the operands are registers thus opcode is 89, and checking the table in [link](http://ref.x86asm.net/geek64.html#modrm_byte_32_64) and thus the mod r/m is e5, but I do not get by 48 is prefixed to the code, how did we get that? – Abhisheyk Deb May 24 '17 at 18:38
  • 1
    @AbhisheykDeb The `48` prefix is a REX.W prefix. It indicates that the operand size is 64 bit instead of 32 bit. – fuz May 24 '17 at 18:52
  • Thank you for the help, Another question, I know I am asking too many, its just that I am very much new to this, for exam I have instruction **mov edi, [ebx-0x4]**, this gets converted to **67 8b 7b fc**. I read in the manual and it said "Address override, 67h. Changes size of address expected by the instruction. 32-bit address could switch to 16-bit and vice versa." Is that a valid understanding? – Abhisheyk Deb May 24 '17 at 19:25
  • @AbhisheykDeb Seems correct. In 16 bit mode and 64 bit mode, `67` changes to 32 bit addressing mode. In 32 bit mode, it changes to 16 bit mode. In 64 bit mode, you should rarely need the `67` prefix. – fuz May 24 '17 at 19:40
  • Okay thank you. If I convert **mov [ebp+0x79], ecx** to machine code I get **67 89 4d 79** which uses a 8 bit displacement, but when I convert **mov [ebp+0x80], ecx** to machine code I get **67 89 8d 80 00 00 00** why does the second instruction uses 32 bit displacement when **80** can be represented in 8 bits. – Abhisheyk Deb May 24 '17 at 20:29
  • @AbhisheykDeb Note that the displacement is a signed integer, so `67 89 5d 80` is actually `mov [ebp-0x80], ecx`. `+0x80` is the first displacement that can't be encoded in a single byte. – fuz May 24 '17 at 20:53
  • Okay thank you. So if I have the Instruction **mov [0x5], edx** and it gets converted to **89 14 25 05 00 00 00**. I am able to understand everything except why **25** is there?. I mean, according to mod r/m table I do get **14** when I check [sib] and edx. And the same for **mov [esp+4], ebp** I get **67 89 6c 24 04 ** but why **24** is there? – Abhisheyk Deb May 24 '17 at 21:32
  • @AbhisheykDeb `25` is the SIB byte indicating a 32 bit absolute address. For your second example, `6c` indicates “[sib] + disp8, ebp” and 24 is the SIB byte, indicating `esp` as a base register and no index register. – fuz May 24 '17 at 21:54
  • Another query was that **mov eax, 0x55** translates to **b8 55 00 00 00** and **mov ebx, 0x55** translates to **bb 55 00 00 00** and **mov ecx, 0x55** translates to **b9 55 00 00 00**. All of them having different opcode. When all the them conform to the form **MOV r32, imm32** which has an opcode **B8+ rd id** according to intel manual – Abhisheyk Deb May 24 '17 at 22:27
  • @AbhisheykDeb Look into the manual. There is a separate encoding for `mov reg32, imm32` which is `b8+reg`. – fuz May 24 '17 at 22:40
  • Okay got it, I was not able to understand the part **+reg** which means we have to add the 3 bits for the respective registers **001** for **ecx** – Abhisheyk Deb May 25 '17 at 19:16
  • Instruction such as **jmp 40050b
    ** in objdump is showing as **eb 14** and **eb 05**, why do they have different machine code when they both are the same instruction?
    – Abhisheyk Deb May 25 '17 at 19:34
  • @AbhisheykDeb That's because the jump target is encoded relative to were we are right now. EB 14 is “jump 14 bytes ahead,” not “jump to 40050b.” Of course, to reach the same target, you have to jump by a different amount from each source. – fuz May 25 '17 at 21:50
  • Okay. So I was reading somewhere, that **the x86 will not allow instructions greater than 15 bytes in length.** is it same for **x86-64** also, and how does these instructions fit into memory? – Abhisheyk Deb May 26 '17 at 15:39
  • 1
    @AbhisheykDeb Can you please not ask hundreds of questions in comments? This is the last follow-up I am going to answer. Yes, the 15 byte limit applies to 64 bit moe, too. If an instruction-encoding would be longer than 15 byte, that instruction encoding is invalid. – fuz May 26 '17 at 16:07