I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:
- two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
- two functions compiled from different C source are likely to have different fingerprints,
- (bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).
What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).
Assumptions
Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:
- linux ELF binaries (without debugging information available, though),
- not obfuscated in any way,
- compiled by gcc,
- on x86 linux (approach that can be implemented on other architectures would be nice).
Ideas
Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:
- types of instructions contained in the function (i.e. floating point instructions, memory barriers)
- memory accesses from the function (does it read/writes from/to heap? stack?)
- library functions called (their names should be available in the ELF; also their order shouldn't usually change)
- shape of the control flow graph (I guess this will be highly dependent on the compiler)
Existing work
I was able to find only tangentially related work:
- Automated approach which can identify crypto algorithms in compiled code: http://www.emma.rub.de/research/publications/automated-identification-cryptographic-primitives/
- Fast Library Identification and Recognition Technology in IDA disassembler; identifies concrete instruction sequences, but still contains some possibly useful ideas: http://www.hex-rays.com/idapro/flirt.htm
Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?