How is machine code generated from assembly?

Question

I am trying to understand how machine code is formed from assembly code.

I am using NASM assembler

Say for example, I have a assembly code like this :

BITS 64;
mov rbx, 0x0123456789abcdef; 
mov rax, rbx;
add rax, rax;
ret;

I run nasm example.S

And I disassemble it ndisasm -b64 example.S (For a 64-bit Little-endian machine)

I get the machine code like this :

00000000  48BBEFCDAB896745  mov rbx,0x123456789abcdef
     -2301
0000000A  4889D8            mov rax,rbx
0000000D  4801C0            add rax,rax
00000010  C3                ret

Can someone explain me what is the relation between the given machine code and the assembly code. How to figure out opcode for each instruction and registers.

Let me be more clear, from the generated code, how can I figure out codes for mov, add, rax , rbx etc — Narasimha Prasanna HN, Sep 13 '19 at 08:54
Yes, I misunderstood the question first. You have two options. 1) Reverse engineer it yourself or 2) Read the documentation for your cpu. — klutt, Sep 13 '19 at 08:55
Basically nasm uses a bunch of tables built from information that can be found in [Intel® 64 and IA-32 Architectures Software Developer Manuals](https://software.intel.com/en-us/articles/intel-sdm). — 500 - Internal Server Error, Sep 13 '19 at 08:56
Another option you have is to look at the source code for nasm. — klutt, Sep 13 '19 at 09:16
Assembly _is_ machine code. The textual assembler instructions are replaced by their equivalent op code numbers and there you have it. To know the op code for a certain assembler instruction, check the CPU manual. One instruction may have different op codes depending on the parameters expected. — Lundin, Sep 13 '19 at 09:17

springborn · Answer 1 · 2019-09-13T11:25:02.907

If you have the machine code and want to understand how it came from the assembly:

Step 1: Find the Instruction Set summary for your processor architecture.

Step 2: Look up which machine code bits in each instruction contain the Opcode. At this point it is useful to have the machine code in binary, unless you're fluent in the hex-binary conversion. At this point you should also look at the endianness.

Step 3: Look up which instruction corresponds to the opcode.

Step 4: Look at the instruction description and find out which bits belong to which instruction field (destination register, addresses, immediates, etc).

Step 5: Write out the instuction according to the numbers in each field. You may need to look up which registers correspond to which numbers.

Now you have disassembled your machine code.

To learn/train this it might be prudent to try it with something like AVR Assembly first, as it's only 16bit instructions.

If you have the assembly and want to assemble it by-hand into machine code:

Step 1: Find the Instruction Set summary for your processor architecture.

Step 2: Find the instruction you want to assemble.

Step 3: Fill the relevant bits with the data the instruction frame demands.

@EricPostpischil disassemble, from wiktionary: "To convert machine code to a human-readable, mnemonic form." — springborn, Sep 13 '19 at 10:59
And that is not what is done when one converts assembly to machine code. One converts the human-readable assembly mnemonics to machine code. — Eric Postpischil, Sep 13 '19 at 11:18
The question asks how to generate machine code from assembly. This answer attempts to state how to generate assembly from machine code. Even at that, it is inadequate. In common architectures, the bits for the opcode are not in fixed positions. There may be prefix bytes that have to be processed first. Then some bits have to be examined to determine what other bits take part in determining the operation. Then the operands may have a variety of forms. Some of them are encoded in various ways. — Eric Postpischil, Sep 13 '19 at 11:26

score 0 · Answer 2 · answered Sep 13 '19 at 18:47

If you want to reverse engineer the meaning of individual bits of each machine instruction, instead of just reading the Intel manuals that were linked from the comments, you need to do it systematically: vary one thing at a time in the input assembly, and see how the machine code changes. For example: assemble

mov rax, rax
mov rax, rbx
mov rax, rcx
mov rax, rdx
mov rax, rsi
mov rax, rdi
mov rax, rbp
mov rax, rsp
mov rax, r8
mov rax, r9
mov rax, r10
mov rax, r11
mov rax, r12
mov rax, r13
mov rax, r14
mov rax, r15

mov eax, eax
mov eax, ebx
mov eax, ecx
mov eax, edx
mov eax, esi
mov eax, edi
mov eax, ebp
mov eax, esp
mov eax, r8d
mov eax, r9d
mov eax, r10d
mov eax, r11d
mov eax, r12d
mov eax, r13d
mov eax, r14d
mov eax, r15d

and you get

   0:   48 89 c0                mov    rax,rax
   3:   48 89 d8                mov    rax,rbx
   6:   48 89 c8                mov    rax,rcx
   9:   48 89 d0                mov    rax,rdx
   c:   48 89 f0                mov    rax,rsi
   f:   48 89 f8                mov    rax,rdi
  12:   48 89 e8                mov    rax,rbp
  15:   48 89 e0                mov    rax,rsp
  18:   4c 89 c0                mov    rax,r8
  1b:   4c 89 c8                mov    rax,r9
  1e:   4c 89 d0                mov    rax,r10
  21:   4c 89 d8                mov    rax,r11
  24:   4c 89 e0                mov    rax,r12
  27:   4c 89 e8                mov    rax,r13
  2a:   4c 89 f0                mov    rax,r14
  2d:   4c 89 f8                mov    rax,r15
  30:      89 c0                mov    eax,eax
  32:      89 d8                mov    eax,ebx
  34:      89 c8                mov    eax,ecx
  36:      89 d0                mov    eax,edx
  38:      89 f0                mov    eax,esi
  3a:      89 f8                mov    eax,edi
  3c:      89 e8                mov    eax,ebp
  3e:      89 e0                mov    eax,esp
  40:   44 89 c0                mov    eax,r8d
  43:   44 89 c8                mov    eax,r9d
  46:   44 89 d0                mov    eax,r10d
  49:   44 89 d8                mov    eax,r11d
  4c:   44 89 e0                mov    eax,r12d
  4f:   44 89 e8                mov    eax,r13d
  52:   44 89 f0                mov    eax,r14d
  55:   44 89 f8                mov    eax,r15d

and from this you can work out which bits of each instruction signify the source register and its width. Then you do the same thing holding the source register fixed and varying the destination register, and then you change mov to add and see what happens, and so on.

It's going to be a lot more work to do this with x86 than with a more uniformly structured CPU architecture, e.g., um, almost anything else.

old_timer · Answer 3 · 2019-09-14T05:06:47.443

you read a/the manual for the processor in question. it will include the machine code and the assembly language for a tool related to the author of the manual. assembly language is specific to the assembler, the tool that reads it it does not have to conform to the processor vendors manual so long as it generates working machine code for that target.

To make an assembler you work from the documentation forward, if you see that there are multiple variants of an add instruction. An add with registers only and an add with the program counter and an add with the stack pointer (assuming these are not also reachable as gprs for this target). The assembler will need to parse the word add after some optional whitespace, with some whitespace after, then the operands. As the assembler parses the operands:

add r1,r2,r3
add r1,r1,r2
add r1,r2
add sp,r1
add r1,pc,r2

assume the assembler has a three register add, r1 = r2 + r3 in the first case. The documentation will indicate the machine code for the three register add, the possible gprs that can be used for each operand and how to encode those into the machine code. Some assemblers may let you abbreviate add r1,r1,r2 with add r1,r2 implying a three register add OR an instruction set might have a two register add that can use a wider range of registers (maybe the instruction set has 32 registers but the three register add is limited to r0-r15 for each operand, the two register can use any of the 32 for the operands). The assembler may still choose to use the three register instruction where there is an overlap. Some assembly language designers choose to make it so there is no overlap the assembler has no way to encode an assembly language instruction more than one way.

As the parser parses the above lines when it sees the stack pointer sp or program counter pc and for my hypothetical instruction set those are not mapped as gprs so the existence of the pc and sp syntax indicates to the assembler to use the sp or pc specific add variant from the instruction set.

There really is zero magic to this, you see an instruction in the instruction set you craft an assembly language syntax for it and then write code that parses it if you have overlaps in the parsing like more than one add, more than one and, more than one mov, then you need(/should) to make it possible to uniquely create each of the possible machine code instructions (with all of their options/modifiers) and design the syntax such that it is possible to parse.

For the specific x86 instructions you are asking about, you of course know the instruction set by referring to the documentation before attempting to start learning assembly language. That documentation in some way will have the opcodes. That instruction set is 8 bit based. The first 8 bits of the instruction begins to tell the processor what is going on. Some instructions are completely described by those 8 bits, others the first 8 bits indicates a possible choice of some list of instructions, the second byte or combination of the next so many bytes further reduces the possible choices until eventually the unique instruction has been identified along with all of its operands and options.

Unfortunately based on the age and the choices of the individuals of the original documentation which in part still lives on, that specific as documented assembly language greatly overloaded a number of instructions such that there were very many possible machine instructions, mov for example. Which is why more than other instruction sets x86 has seen so many different assembly languages to try to resolve the confusion using less syntax. Instruction sets that came later learned from the ones before with respect to both the design of the instruction set as well as the design of the assembly language.

Another thing that happens, and the longer lived the instruction set, the more like it is to happen. Is that an assembler author may prefer a certain syntax 5(r1) instead of [r1,5] (addressing mode where the address is the contents of r1 plus the constant 5 decimal) basically making a non-mips instruction set mips like (this seems to be a trend, learn MIPS in college then try to make non-MIPS processors look and feel like your first instruction set rather than honor their history and diversity). Or using %r1 instead of r1, and so on, taking other syntax items and re purposing them onto a different instruction set.

Assembly language authors are only limited by their odds of someone using their tool, if they go too far with their syntax, then they may not end up with any users, and nobody will know that tool exists. If that tool is part of an otherwise popular high level language compiler, (the output of the compiler is asm and that is assembled by this assembler then linked by a linker. A "tool chain") then you may be forced into this assembler if you want to use that compiler, warts and all. Ideally you would be writing very little to zero assembly language if programming in the high level language.

How is machine code generated from assembly?

3 Answers3