10

I'm just in the process of writing a PE file parser and I've reached the point where I'd like to parse and interpret the actual code within PE files, which I'm assuming are stored as x86 opcodes.

As an example, each of the exports within a DLL point to RVAs (Relative Virtual Offsets) of where the function will be stored within memory, and I've written a function to convert these RVAs to physical file offsets.

The question is, are these really opcodes, or are they something else?

Does it depend on the compiler/linker as to how the functions are stored within the file, or are they one or two byte X86 opcodes.

As an example, the Windows 7 DLL 'BWContextHandler.dll' contains four functions that are loaded into memory, making them available within the system. The first exported function is 'DllCanUnloadNow', and it is located at offset 0x245D within the file. The first four bytes of this data are: 0xA1 0x5C 0xF1 0xF2

So are these one or two byte opcodes, or are they something else entirely?

If anyone can provide any information on how to examine these, it would be appreciated.

Thanks!

After a bit of further reading, and running the file through the demo version of IDA, I think I'm correct in saying that the first byte 0xA1, is a one byte opcode, meaning mov eax. I got that from here: http://ref.x86asm.net/geek32.html#xA1 and I'm assuming it is correct for the time being.

However, I'm a bit confused as to how the bytes following comprise the rest of the instruction. From the x86 assembler that I know, a move instruction requires two parameters, the destination and the source, so the instruction is to move (something) into the eax register, and I'm assuming that the something comes in the following bytes. However I don't know how to read that information yet :)

Tony
  • 3,587
  • 8
  • 44
  • 77
  • 2
    This related posting http://stackoverflow.com/questions/2170843/va-virtual-adress-rva-relative-virtual-address contains a lot of info that could help you. – fvu Dec 07 '12 at 13:29
  • Thanks fvu, I'll have a read! – Tony Dec 07 '12 at 13:30
  • The `.text` section can contain both code and read-only data (but mostly code). You can use a disassembler to make sure what corresponds to what. – Sedat Kapanoglu Dec 07 '12 at 13:33
  • Thanks ssg, the code I mentioned at the top is within the .text section, so it's either code or data, I'm not sure what though. Is it possible to disassemble Windows DLLs? And if so can you recommend a Windows dissasembler, I know of IDA but I also know it's not free. Thanks for the comment! – Tony Dec 07 '12 at 13:35
  • 3
    Run dumpbin.exe on your executable with the /disasm option to see how it is done. Do note that you are reinventing the wheel. Always compare what dumpbin.exe tells you with what you display to ensure it isn't a square one. – Hans Passant Dec 07 '12 at 14:50
  • Thanks Hans, I know I am reinventing it to some extent, but I often find the best way for me to gain a better understanding of something is to implement some of the code myself. The PE parser I'm writing is designed to help me analyse PE files with regards to my PhD, so the more I can understand the better, thanks for your comment! – Tony Dec 07 '12 at 14:55
  • Also see the intel instruction set reference which gives all the encoding details. – Jester Dec 07 '12 at 15:48
  • It's a `mov eax, 0xXXf2f15c` where `XX` is a missing fifth byte. – szx Dec 07 '12 at 15:48
  • 2
    BTW I've just found this wonderful website: http://www.onlinedisassembler.com/odaweb/run_hex <-- much faster than running IDA for small amounts of code – szx Dec 07 '12 at 15:51

2 Answers2

7

x86 encoding is complex multi-byte encoding and you can't simply find a single line in instruction table to decode it as it was in RISC (MIPS/SPARC/DLX). There can be even 16-byte encodings of one instruction: 1-3 byte opcode + several prefixes (including multibyte VEX) + several fields to encode immediate or memory address, offset, scaling (imm, ModR/M and SIB; moffs). And there are sometimes tens opcodes for single mnemonic. And more, for several cases there are two encoding possible of the same asm line ("inc eax" = 0x40 and = 0xff 0xc0).

one byte opcode, meaning mov eax. I got that from here: http://ref.x86asm.net/geek32.html#xA1 and I'm assuming it is correct for the time being.

Let's take a view on the table:

po ; flds ; mnemonic ; op1 ; op2 ; grp1 ; grp2 ; Description

A1 ; W ; MOV ; eAX ; Ov ; gen ; datamov ; Move ;

(HINT: don't use geek32 table, switch to http://ref.x86asm.net/coder32.html#xA1 - is has less fields with more decoding, e.g. "A1 MOV eAX moffs16/32 Move")

There are columns op1 and op2, http://ref.x86asm.net/#column_op that are for operands. First one for A1 opcode is always eAX, and second (op2) is Ov. According to table http://ref.x86asm.net/#Instruction-Operand-Codes:

O / moffs Original The instruction has no ModR/M byte; the offset of the operand is coded as a word, double word or quad word (depending on address size attribute) in the instruction. No base register, index register, or scaling factor can be applied (only MOV (A0, A1, A2, A3)).

So, after A1 opcode the memory offset is encoded. I think, there is 32-bit offset for x86 (32-bit mode).

PS: If your task is parse PE and not invent disassembler, use some x86 disassembling library like libdisasm or libudis86 or anything else.

PPS: For original question:

The question is, are these really opcodes, or are they something else?

Yes, "A1 5C F1 F2 05 B9 5C F1 F2 05 FF 50 0C F7 D8 1B C0 F7 D8 C3 CC CC CC CC CC" is x86 machine code.

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513
  • Thanks very much osgx, that has answered my question. I'll have a look through the pages and see how much I can work out. Thanks again!! – Tony Dec 07 '12 at 19:25
5

Disassembly is difficult, particularly for code generated by the Visual Studio compiler, and particularly for x86 programs. There are several issues:

  1. Instructions are variable length, and can start at any offset. Some architectures require instruction alignment. Not x86. If you start reading at address 0, then you will get different results then if you start reading at offset 1. You have to know what the valid "starting locations" (function entry points) are.

  2. Not all addresses in the text section of an executable are code. Some are data. Visual Studio will place "jump tables" (arrays used to implement switch statements) in the text section under neath the procedure that reads them. Misinterpreting data as code will lead you to produce incorrect dis-assembly.

  3. You can't have perfect dis-assemby that will work with all possible programs. Programs can modify themselves. In those cases you have to run the program to know what it does, and that ends up leading to the "halting problem". The best you can hope for is dis-assembly that works on "most" programs.

The algorithm typically used to try and address these issue is called "recursive descent" dis-assembly. It works similarly to a recursive descent parser, in that it starts with a known "entry point" (either the "main" method of an exe, or all the exports of a dll) and then starts disassembling. Other entry points are discovered during dis-assembly. For example, given a "call" instruction, the target will be assumed to be an entry point. The dis-assembler will iteratively disassemble discovered entry points until no more are found.

That technique, however, has some problems. It won't find code that is only ever executed through indirection. On windows, a good example is handlers for SEH exceptions. The code that dispatches to them is actually inside the operating system, so recursive descent dis-assembly will not find them, and won't disassemble them. However, they can often be detected by augmenting recursive descent with pattern recognition (heuristic matching).

Machine learning can be used to automatically identify patterns, but many dis-assemblers (like IDA pro) use hand written patterns with a good deal of success.

In any case, if you want to disassemble x86 code, you need to read the Intel Manual. There are a lot of scenarios that need to be supported. The same bit patterns in an instruction can be interpreted in various different ways depending on modifiers, prefixes, the implicit state of the processor, etc. That's all covered in the manual. Start by reading through the first few sections of Volume I. That will walk through the basic execution environment. Most of the rest of the stuff you need is in Volume II.

Scott Wisniewski
  • 24,561
  • 8
  • 60
  • 89
  • Thanks Scott. I think I mostly wanted to just gain a rough understanding of how the code within a PE file was structured. The main task I'm trying to accomplish was the parsing of the PE file, so I think I'll leave the dissembler work to one of the existing libraries. Thanks for your post though, very informative! – Tony Dec 09 '12 at 13:55
  • Scott, is "recursive descent" the method of choice in IDA? – osgx Dec 10 '12 at 00:19
  • Yes. Ida uses recursive descent. – Scott Wisniewski Dec 10 '12 at 03:00