8

I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:

  1. two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
  2. two functions compiled from different C source are likely to have different fingerprints,
  3. (bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).

What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).

Assumptions

Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:

  • linux ELF binaries (without debugging information available, though),
  • not obfuscated in any way,
  • compiled by gcc,
  • on x86 linux (approach that can be implemented on other architectures would be nice).

Ideas

Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:

  • types of instructions contained in the function (i.e. floating point instructions, memory barriers)
  • memory accesses from the function (does it read/writes from/to heap? stack?)
  • library functions called (their names should be available in the ELF; also their order shouldn't usually change)
  • shape of the control flow graph (I guess this will be highly dependent on the compiler)

Existing work

I was able to find only tangentially related work:


Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?

b42
  • 81
  • 3
  • Shape of the control flow graph might well vary by optimization (and version) - introducing equivalent branchless code, duplicating or de-duplicating basic blocks in different places, inlining. Similarly you could have trouble with calls to library functions that gcc has intrinsics for - maybe it sometimes inlines and sometimes makes a call, you get that a lot with `memcpy` for example. LTO will ruin your day. – Steve Jessop Sep 02 '11 at 13:07
  • 5
    This is a very hard problem. As for why it's useful, identifying stolen GPL'd code is one of the most useful applications... – R.. GitHub STOP HELPING ICE Sep 02 '11 at 13:12
  • @Skizz: Right, I guess I should have mentioned it in the post. I would like to compare two stacktraces from two core dumps in order to determine whether they are a result of the same bug. Using abovementioned approach, I'd like to see if this is possible for, say, two slightly different versions of the same program. – b42 Sep 05 '11 at 13:22

4 Answers4

5

FLIRT uses byte-level pattern matching, so it breaks down with any changes in the instruction encodings (e.g. different register allocation/reordered instructions).

For graph matching, see BinDiff. While it's closed source, Halvar has described some of the approaches on his blog. They even have open sourced some of the algos they do to generate fingerprints, in the form of BinCrowd plugin.

Igor Skochinsky
  • 24,629
  • 2
  • 72
  • 109
  • Looks interesting & relevant. Even though I cannot use it directly, I'll go through the blog. Thanks. – b42 Sep 05 '11 at 13:25
  • @b42 There's also a poor man's alternative called PatchDiff, which [is opensource](http://code.google.com/p/patchdiff2/). It's mainly geared towards diffing patches (= mostly similar binaries), so might not work very well on files that differ a lot. – Igor Skochinsky Sep 06 '11 at 23:48
0

In my opinion, the easiest way to do something like this would be to decompose the functions assembly back into some higher level form where constructs (like for, while, function calls etc.) exist, then match the structure of these higher level constructs.

This would prevent instruction reordering, loop hoisting, loop unrolling and any other optimizations messing with the comparison, you can even (de)optimize this higher level structures to their maximum on both ends to ensure they are at the same point, so comparisons between unoptimized debug code and -O3 won't fail out due to missing temporaries/lack of register spills etc.

You can use something like boomerang as a basis for the decompilation (except you wouldn't spit out C code).

Necrolis
  • 25,836
  • 3
  • 63
  • 101
  • 1
    How you do propose to reverse-engineer optimised code back to the original high-level control constructs? Or more specifically, how do you propose to reverse-engineer code in such a way that two *differently-optimised* binaries end up with the same decompiled structure? Haven't you just shifted the problem? – Oliver Charlesworth Sep 02 '11 at 16:15
  • @Oli: I never said you go back to the *original*, but rather an intermediate form of it. I don't propose going to extremes, as an example: you have only two loop constructs, `do-while` and `while`, an unoptimized version will keep that structure, the optimized version might do some folding, cse and hoisting, but it can still be composed as a [do]while loop, from there you optimize both, which should have them come to the same higher level variant. Or more simply churning the cream to butter to see if it matches the other butter you got :p – Necrolis Sep 02 '11 at 16:55
  • @Necrolis: I wasn't thinking about going back to the original. I was thinking that if you have two differently-optimised binaries, the basic-block structure could be completely different, depending on how conditional statements, etc. have been re-structured. – Oliver Charlesworth Sep 02 '11 at 17:24
  • @Necrolis: even loops can look totally different. E.g. if you have a for loop from 0 to 10 but don't actually use the index value, the optimizer will likely rewrite it to count from 9 down to 0 (comparison with zero is usually "free" when decrementing). While loops often have the first iteration check separate from the main body loop, and so on. – Igor Skochinsky Sep 02 '11 at 17:42
  • @Igor: thats fine, if the loops that you are comparing can both be optimized to the same form(it may only be that 1 is optimized, it may happen that they no longer become loops), then it doesn't matter, its all in the heuristics that you employ to create the higher representation.happen that they no longer become loops), then it doesn't matter, its all in the heuristics that you employ to create the higher representation. – Necrolis Sep 02 '11 at 19:53
  • This seems like a lot of work with uncertain results. But thanks for the boomerang tip, I'll give it a try. – b42 Sep 05 '11 at 13:40
  • Emscripten (a C-to-JavaScript compiler using LLVM) contains an algorithm called "relooping" which sounds pretty similar to this, although it's working with some LLVM IR and not the actual assembly output: https://github.com/kripken/emscripten/wiki (check the Technical Paper link there for exact details) – Ted Mielczarek Feb 02 '12 at 16:11
0

I suggest you approach this problem from the standpoint of the language the code was written in and what constraints that code puts on compiler optimization.

I'm not real familiar with the C standard, but C++ has the concept of "observable" behavior. The standard carefully defines this, and compilers are given great latitude in optimizing as long as the result gives the same observable behavior. My recommendation for trying to determine if two functions are the same would be to try to determine what their observable behavior is (what I/O they do and how the interact with other areas of memory and in what order).

SoapBox
  • 20,457
  • 3
  • 51
  • 87
0

If the problem set can be reduced to a small set of known C or C++ source code functions being compiled by n different compilers, each with m[n] different sets of compiler options, then a straightforward, if tedious, solution would be to compile the code with every combination of compiler and options and catalog the resulting instruction bytes, or more efficiently, their hash signature in a database.

The set of likely compiler options used is potentially large, but in actual practice, engineers typically use a pretty standard and small set of options, usually just minimally optimized for debugging and fully optimized for release. Researching many project configurations might reveal there are only two or three more in any engineering culture relating to prejudice or superstition of how compilers work—whether accurate or not.

I suspect this approach is closest to what you actually want: a way of investigating suspected misappropriated source code. All the suggested techniques of reconstructing the compiler's parse tree might bear fruit, but have great potential for overlooked symmetric solutions or ambiguous unsolvable cases.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • Unfortunately, I need it to work on the binaries alone, so I can't do this. And I guess the set of functions is too large anyway -- say, typical linux distro. But thanks for the suggestion. – b42 Sep 05 '11 at 13:36