How disassembler extract opcode from memory works?

Question

I'm trying to figure out how disassembler works. Specifically, how the content in memory maps to the corresponding assembly language opcode.

Below is the content in memory, first column address:

773eed5c  50 ff 15 0c 17 3a 77 90-90 90 90 90 8b ff 55 8b  P....:w.......U.
773eed6c  ec 51 51 83 7d 10 00 74-38 ff 75 10 8d 45 f8 50  .QQ.}..t8.u..E.P
773eed7c  e8 14 d7 ff ff 85 c0 74-24 56 ff 75 fc ff 75 0c  .......t$V.u..u.
773eed8c  ff 75 08 e8 ab ff ff ff-83 7d 10 00 8b f0 74 0a  .u.......}....t.
773eed9c  8d 45 f8 50 ff 15 9c 15-3a 77 8b c6 5e c9 c2 0c  .E.P....:w..^...
773eedac  00 83 65 fc 00 eb d2 90-90 90 90 90 8b ff 55 8b  ..e...........U.
773eedbc  ec 57 e8 c7 d6 ff ff 8b-4d 0c 6a 34 5f 03 c7 0f  .W......M.j4_...
773eedcc  b7 00 40 40 03 c9 3b c8-0f 82 07 e5 00 00 56 e8  ..@@..;.......V.

And the corresponding disassemble result, first column memory address, second column opcode of instruction, the rest columns assembly instructions:

0x773eed5c 50 push    eax
0x773eed63 90 nop
0x773eed65 90 nop
0x773eed67 90 nop
0x773eed6a 55 push    ebp
0x773eed6d 51 push    ecx
0x773eed6f 837d1000 cmp     dword ptr [ebp+10h],0 ss:0023:056cfa8c=778237eb
0x773eed7c e814d7ffff call    kernel32!Basep8BitStringToDynamicUnicodeString (773ec495)

Now I can see the opcode e814d7ffff is on memory literally (e8 14 d7 ff ff)

But how to interpret the content in memory address 0x773eed5c? How the opcode for push eax and the consecutive nops maps to memory content 0c15ff50 90773a17 90909090 8b55ff8b?

UPDATE:

The disassemble result I gave above is incorrect. The correct result, as shown below, fits the content in memory nicely:

0x773eed5c 50 push    eax
0x773eed5d ff150c173a77 call    dword ptr [kernel32+0x170c (773a170c)] ds:0023:773a170c={ntdll!RtlExitUserThread (777ef608)}
0x773eed63 90 nop
0x773eed64 90 nop
0x773eed65 90 nop
0x773eed66 90 nop
0x773eed67 90 nop
0x773eed68 8bff mov     edi,edi
0x773eed6a 55 push    ebp
0x773eed6b 8bec mov     ebp,esp
0x773eed6d 51 push    ecx
0x773eed6e 51 push    ecx
0x773eed6f 837d1000 cmp     dword ptr [ebp+10h],0 ss:0023:0447fc24=778237eb
0x773eed73 7438 je      kernel32!OpenFileMappingA+0x45 (773eedad) [br=1]
0x773eed75 ff7510 push    dword ptr [ebp+10h]  ss:0023:0447fc24=778237eb
0x773eed78 8d45f8 lea     eax,[ebp-8]
0x773eed7b 50 push    eax
0x773eed7c e814d7ffff call    kernel32!Basep8BitStringToDynamicUnicodeString (773ec495)

For details about my mistake: I'm using pykd to develop a tool around WinDbg. The documentation about its disasm module doesn't cover the detail, so I used the wrong param to the disasm.jumprel function, which result in the incomplete disassemble result.

Read your architecture's instruction manual; each instruction corresponds to a numerical value. — Kerrek SB, Mar 23 '14 at 19:03
The x86 instruction set is documented (see Intel's Software Developer's Manuals), and so is the PE (exe/dll) format. If you want to write a disassembler that produces output like the one you've shown in your question, you'd need a good grasp of both those sources of information. — Michael, Mar 23 '14 at 19:04
In your code, `push eax` and `nop`s are not consecutive, their addresses differ by quite a few bytes. What disassembler/memory viewer are you using (at first glance, it looks like WinDBG), and why are you dumping memory as DWORDs rather than single BYTEs? — DCoder, Mar 23 '14 at 19:04
@KerrekSB Yes I get this part, what I'm confusing is that how these numbers placed on the memory and how disassembler know the placement and extract these opcode from memory. — yegle, Mar 23 '14 at 19:05
@yegle: Because that information is typically available in the headers of the executable file (which in this case would be PE). — Michael, Mar 23 '14 at 19:07
@DCoder Yes this is WinDbg output and I've updated the question to use BYTEs instead of DWORDs. — yegle, Mar 23 '14 at 19:08
@Michael I'm not asking about how module was placed in memory, that's a different question. What I want to know is that given a long hex number (memory content), how disassembler extract the right opcode as I show in my question? — yegle, Mar 23 '14 at 19:11
@yegle: The disassembler can just parse the executable file to get the sections that contain code, and then start disassembling those sections instruction by instruction. — Michael, Mar 23 '14 at 19:13
This question might be better suited for the [Reverse Engineering SE](http://reverseengineering.stackexchange.com/). — DCoder, Mar 23 '14 at 19:14
Ok so assuming this all makes sense, why is there so much missing from the disassembly? — harold, Mar 23 '14 at 19:14
Harold's question is very good. Like I said, there's a lot of bytes/instructions missing between the lines of disassembly you have provided, how did you get this listing? — DCoder, Mar 23 '14 at 19:19

score 2 · Answer 1 · edited Mar 23 '14 at 19:25

It's pretty simple actually.

Look at: http://www.mathemainzel.info/files/x86asmref.html

It's an x86 Instruction Set Reference.

If you look for "PUSH AX", you'll see that the opcode is 50. If you look for "NOP", you'll see that its opcode is 90.

So, what happens is you have a collection of what every opcode looks like (50 == PUSH AX, 90 == NOP, etc.). Some opcodes require more parameters than others. The CALL opcode has 4 modes, the first one, E8, is for a "near pointer".

Now, the x86 has different operating modes (16b, 32b, 64b), so it reuses the same opcodes, but tweaks the parameters for the different modes. This is something the disassembler needs to know in advance. Because a "near pointer" is different in 16b, 32b, and 64b modes (they take more space, among other things).

But in the end, a simple disassembler looks up its current opcode, consumes as many bytes as are required based on the opcode, and then creates the appropriate assembler instruction for that piece of memory.

More sophisticated disassemblers understand higher level languages, can point out areas that are not accessed by code (for example, it can track Jumps, Calls, and Branches and know which code it doesn't disassemble).

Disassemblers can get quite sophisticated, but a simple one is simple.

Hi Will, but why in my example, the disassembler _consumes_ `50 ff 15 0c 17 3a 77` and give `50` as the final opcode? — yegle, Mar 23 '14 at 19:16
That's a good question, I can't say why it's skipping that large chunk of code. There's an online disassembler here: http://www.onlinedisassembler.com/odaweb/#view/tab-assembly/offset/00000000 It found the missing CALL mentioned above. — Will Hartung, Mar 23 '14 at 19:23

score 2 · Accepted Answer · answered Mar 23 '14 at 19:19

2

There does seem to be some stuff missing.

50                push eax
ff 15 0c 17 3a 77 call [0x0c173a77] ; where did this thing go?
90                nop
90                nop
90                nop
90                nop
90                nop
8b ff             mov edi, edi  ; wut?
55                push ebp      ; this looks like the beginning of a function
8b ec             mov ebp, esp
51                push ecx
51                push ecx
83 7d 10 00       cmp [ebp + 10], 0

I disassembled this manually, I may have made mistakes. This code is weird. Your disassembly of it is even weirder and I have no idea how it happened.

answered Mar 23 '14 at 19:19

harold

61,398
6
86
164

Almost like the OP omitted every second instruction in his listing... [Here's an explanation for the `mov edi, edi` wut.](http://blogs.msdn.com/b/oldnewthing/archive/2011/09/21/10214405.aspx) – DCoder Mar 23 '14 at 19:21
@DCoder ah thanks, there's that mystery solved at least – harold Mar 23 '14 at 19:22
@harold you are absolutely right. This is a bug in my script. I'll update the question. – yegle Mar 23 '14 at 19:46

How disassembler extract opcode from memory works?

2 Answers2