11

Why is the process of disassembling a native Win32 image (built in C/C++ for e.g.) miles more difficult than disassembling a .NET app?

What is the main reason? Because of what?

Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
Secret
  • 2,627
  • 7
  • 32
  • 46
  • 6
    .NET because of CLI and CLR have a lot more metadata in them so that they can talk to each other VB,C#, etc... this means it's easier to disassemble as the object code has to be more verbose to allow this. – Jesus Ramos Jan 11 '13 at 19:11
  • 1
    By disassemble do you mean reverse-engineer an application into its component parts? If so then maybe because .NET applications are all about components, and these are higher-level components than, say, low-level assembly language modules that make up a C++ program. – Darth Continent Jan 11 '13 at 19:11
  • 1
    metadata information , .net provides more information than others. – Paritosh Jan 11 '13 at 19:12
  • 1
    Disassembling means, in my book, to go from a binary to an equivalent assembly language program. It's trivial to disassemble native code, though you obviously will get native assembly code rather than CIL. Decompilation (going from binary to source code in a high-level language) is a different matter. –  Jan 11 '13 at 19:31
  • Neither is that difficult. Clearly CIL is easier to figure out how to read then straight x86-64 assembly code. – Security Hound Jan 11 '13 at 19:35
  • Faster assembly code = more obscure code = more difficult to decompile. – user541686 Jan 11 '13 at 21:58

5 Answers5

16

A .net assembly is built into Common Intermediate Language. It is not compiled until it is about to be executed, when the CLR compiles it to run on the appropriate system. The CIL has a lot of metadata so that it can be compiled onto different processor architectures and different operating systems (on Linux, using Mono). The classes and methods remain largely intact.

.net also allows for reflection, which requires metadata to be stored in the binaries.

C and C++ code is compiled to the selected processor architecture and system when it is compiled. An executable compiled for Windows will not work on Linux and vice versa. The output of the C or C++ compiler is assembly instructions. The functions in the source code might not exist as functions in the binary, but be optimized in some way. Compilers can also have quite agressive optimizers that will take logically structured code and make it look very different. The code will be more efficient (in time or space), but can make it more difficult to reverse.

Steve
  • 7,171
  • 2
  • 30
  • 52
  • It is easy to compile C++ code on Windows to run on Linux on different architecture. That's how all ARM compilers work. There is no ARC compiler written on ARM. Big chunk of C++ optimization happens before it is transformed to assembler. So those patterns are common and can be inverted by disassembler on all platforms. – Dennis Jan 11 '13 at 19:37
  • It is also possible to compile a C++ code into object library on Windows and them link it on Linux. The only difference between C++ application on Linux and Windows is the API of OS that they use and the format of the executable file. x86 instruction are the same on any OS. – Dennis Jan 11 '13 at 19:39
  • @Dennis What I meant to write was "An executable compiled for Windows...". I appreciate you can compile on one platform and target it at another. The compilation process doesn't make it impossible to disassmble the code, but it can make harder. I've made some tweaks that hopefully clarify. – Steve Jan 11 '13 at 19:41
  • @Dennis I would argue that the different API of the OS is another reason it is harder to reverse a natively built application. With .net the library calls will be common. Although I appreciate the question is about disassembly rather than reversing. – Steve Jan 11 '13 at 20:38
  • What I meant guys is that you are right but only to a degree. All these arguments is noise in comparison with the lack of symbolic information in C++. – Dennis Jan 11 '13 at 20:41
  • Try to remove all symbolic information from disassembled .NET code and all those benefits of .NET that you listed will not help you to understand the result code. – Dennis Jan 11 '13 at 20:42
  • On other side try to add symbolic information to badly decompiled C++ code and it is not as bad as you would think. – Dennis Jan 11 '13 at 20:43
  • +1 for only person to mention both metadata *and* insane compiler optimizations! – BlueRaja - Danny Pflughoeft Jan 12 '13 at 00:13
14

Due to the implementation of .NET allowing for interoperability between languages such as C#,VB, and even C/C++ through the CLI and CLR this means extra metadata has to be put into the object files to correctly transmit Class and object properties. This makes it easier to disassemble since the binary objects still contain that information whereas C/C++ can throw that information away since it is not necessary (at least for the execution of the code, the information is still required at compile time of course).

This information is typically limited to class related fields and objects. Variables allocated on the stack will probably not have annotations in a release build since their information is not needed for interoperability.

Jesus Ramos
  • 22,940
  • 10
  • 58
  • 88
  • +1. As mentioned in the answer if you have equivalent metadata for native code which is normally not available along with native binaries (i.e. complete PDB for VS build C/C++ code) reverse-engineering of the native code is significantly easier. – Alexei Levenkov Jan 11 '13 at 19:18
  • @AlexeiLevenkov Especially if someone decides to ship a debug release with all the symbol tables as well :P, especially annotated code generated by some compilers which puts plaintext code snippets in the object file for context purposes during debugging. – Jesus Ramos Jan 11 '13 at 19:20
6

One more reason - optimizations that most C++ compilers perform when producing final binaries are not performed on IL level for managed code.

As result something like iteration over container would look like couple inc /jnc assembly instructions for native code compared with function calls with meaningful names in IL. Resulting executed code may be the same (or at least close) as JIT compiler will inline some calls similar to native compiler, but the code one can look at is much more readable in CLR land.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • This is no as important as it sounds. The compiler optimization techniques are knows. – Dennis Jan 11 '13 at 19:33
  • .NET also does optimization of IL code. It opens loops, simplifies IF statements, and stuff like that. Tools like dotPeek know how to recognize them. The same is applicable to C++. Of cause in the case of C++ it is harder. But if you would have names of all parameters, methods, and class fields, the result code would be good enough even considering all the optimization. – Dennis Jan 11 '13 at 19:35
  • This is precisely the correct answer. It's not so much about metadata as it is about the lack of optimizations. – user541686 Jan 11 '13 at 21:57
  • 1
    @Dennis: It *is* that important. In optimized compiled code, where stack setup can get shuffled around or removed altogether, it can often be difficult for a **human** *(nevertheless a computer)* to determine something as simple as where a function begins or ends; and when compilers [routinely do things like](http://www.blueraja.com/blog/285) translating `(a == 4 ? 54 : 2)` into `-(a-4 != 0) & -52`, it's easy to see why optimized code would be seriously more difficult to reverse engineer. But, yes, I have to disagree with Mehrdad and say the metadata is quite important too. – BlueRaja - Danny Pflughoeft Jan 12 '13 at 00:09
4

People have mentioned some of the reasons; I'll mention another one, assuming we're talking about disassembling rather than decompiling.

The trouble with x86 code is that distinguishing between code and data is very difficult and error-prone. Disassemblers have to rely on guessing in order to get it right, and they almost always miss something; by contrast, intermediate languages are designed to be "disassembled" (so that the JIT compiler can turn the "disassembly" into machine code), so they don't contain ambiguities like you would find in machine code. The end result is that disassembly of IL code is quite trivial.

If you're talking about decompiling, that's a different matter; it has to do with the (mostly) lack of optimizations for .NET applications. Most optimizations are done by the JIT compiler rather than the C#/VB.NET/etc. compiler, so the assembly code is almost a 1:1 match of the source code, so figuring out the original is quite possible. But for native code, there's a million different ways to translate a handful of source lines (heck, even no-ops have a gazillion different ways of being written, with different performance characteristics!) so it's quite difficult to figure out what the original was.

user541686
  • 205,094
  • 128
  • 528
  • 886
  • +1 based on the context, I assumed 'disassemble' was being used as a misnomer for 'decompile'. But you are correct. And since you didn't mention it explicitly: *"Disassembling"* refers to translating the machine code to assembly code, while *"decompiling"* refers to translating assembled code *(machine or assembly code for C++; IL for .Net)* back into readable, preferably idiomatic code for a higher language like C++ or C#. Disassembling for some architectures is trivial, but for x86 it's technically unsolvable! Fortunately, disassemblers are able to get it right 99.9% of the time.. – BlueRaja - Danny Pflughoeft Jan 12 '13 at 00:22
1

In general case there is no much difference between disassembling C++ and .NET code. Of cause C++ is harder to disassemble because it does more optimizations and stuff like that, but that's not the main issue.

The main issue is with names. A disassembled C++ code will have everything named as A,B,C,D,...A1, and etc. Unless you could recognize an algorithm in such format, there is not much information you could extract from the disassembled C++ binary.

The .NET library on the other side contains in it names of methods, method parameters, class names, and class field names. It makes understanding of the disassembled code much easier. All other stuff is secondary.

Dennis
  • 2,615
  • 2
  • 19
  • 20