7

I've been going through some Assembly Programming Videos to get a better understanding of how to manually optimize the *.s files left after compiling with gcc/g++ -S ... One of the topics covered was Refactoring Redundant Code that demonstrates how to move redundant code to its own labeled block ending with a ret and replacing it with a call.

The example given in the video is 2 blocks containing:

mov eax,power
mul ebx
mov power,eax
inc count

which it replaces with call CalculateNextPower and CalculateNextPower looks like:

CalculateNextPower:
mov eax,power
mul ebx
mov power,eax
inc count
ret

Out of curiosity, trying to reduce compiled size, I compiled some C and C++ projects with -S and various optimizations including -Os,-O2,-O3,-pipe,-combine and -fwhole-program and analyzed the resulting *.s files for redundancy using a lightly patched (for .s files) version of duplo. Only -fwhole-program (now deprecated IIRC) had a significant effect toward eliminating duplicate code across files (I assume its replacement(s) -flto would behave similarly during link time - roughly equivalent to compiling with -ffunction-sections -fdata-sections and linking with --gc-sections) but still misses significantly large blocks of code.

Manual optimization using the duplo output resulted in ~10% size reduction in a random C project and and almost 30% in a random C++ project when only contiguous blocks of assembly with at least 5 contiguous duplicate instructions were deduplicated.

Am I missing a compiler option (or even a standalone tool) that eliminates redundant assembly automatically when compiling for size (including other compilers: clang, icc, etc..) or is this functionality absent (for a reason?)?

If it doesn't exist, it could be possible to modify duplo to ignore lines starting with a '.' or ';' (and others?) and replace duplicate code blocks with calls to functions with the duplicate code, but I am open to other suggestions that would work directly with the compiler's internal representations (preferably clang or gcc).

Edit: I patched duplo to identify blocks of duplicate assembly here, but it still requires manual refactoring at the moment. As long as the same compiler is used to generate the code, it may be possible (but probably slow) to identify the largest blocks of duplicate code, put them in their own "function" block and replace the code with a CALL to that block.

technosaurus
  • 7,676
  • 1
  • 30
  • 52
  • That's reflect what I learned for embedded programming : compilers are stupid; developers are smart. So when really required *(less than 8 bits memory)* I always used optimisations which may strongly impact code maintainability. In your case, You can try to look at --param values. For some of them, they are warnings about ram or time usage in the case of really big values. The first one to set is --param lto-partittions=1, so you compile the whole source as a single c file *(instead of 32)*. – user2284570 Jun 15 '14 at 23:54
  • I was wondering how to get the full functionality of -pipe -combine -fwhole-program back. Thanks @user2284570. – technosaurus Jun 16 '14 at 05:27
  • I read the entire man page to proceed, and **you should do the same**... The result is each time I launch gcc, the command printed take half of my full HD screen due to the number of compiler flags and optimisations. Don't forget to use `-march=native -mtune=native`. But don't forget the rule : a single man made optimizations is often more efficient than many compiler ones. In some embedded environments, programs are made of *goto and labels*. That's why code maintainability/readability is often opposed to code speed and memory usage. – user2284570 Jun 16 '14 at 13:03

2 Answers2

2

What you want is a clone detector tool.

These exist in a variety of implementations that vary depending on the granularity of elements of the document being processed, and how much structure is available.

Those that match raw line (won't work for you, you want to parameterize your subroutines by differing constants [both data and index offset] and or named locations or other named suboutines). The token based detectors might work, in that they will identify single-point places (e.g., constants or identifiers) that vary. But what you really want is a structural matcher, that can pick out variant addressing modes or even variants in code in the middle of the block (See AST based clone detectors, which I happen to build).

To detect with structure, you have to have structure. Fortunately, even assembly language code has structure in the form of a grammar, and blocks of code delimited by subroutine entries and exits (these latter are bit more problematic to detect in assembly, because there may be more than one of each).

When you detect using structures, you have at least the potential to use the structure to modify the code. But if you have the program source represented as a tree, you have structure (subtrees and sequences of subtrees) over which to detect clones, and one can abstract clone matches by modifying the trees at the match points. (Early versions of my clone detector for COBOL abstracted clones into COPY libs. We stopped doing that mostly because you don't want to abstract every clone that way).

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 1
    +1 for the "clone detector" hint. It didn't help me find a full fledged solution, but could be a good starting point. – technosaurus Jan 21 '14 at 04:55
  • Check my bio for a language-parameterized structural clone detector which could easily be parameterized by an assembler description. The machinery behind the clone detector can do the abstraction step. – Ira Baxter Jan 21 '14 at 05:22
1

What you are proposing is called procedural abstraction and has been implemented by more than one group as research projects. Here is one. Here's another. And another.

Clone detection is normally used in the context of source code, though its function is similar. Since procedural abstraction occurs at a lower level, it can accomplish more. For example, suppose there are two calls to different functions, but with exactly the same complicated argument computations. A procedural abstractor can pull the argument calculation into a procedure, but a clone detector would have a hard time doing so.

I don't believe either gcc or llvm currently has a supported implementation of PA. I searched both sets of documents and didn't find one. In at least two cases above, the optimizer is running on assembly code produced by gcc rather than as a gcc internal optimization. This probably explains why these techniques were not built into the compiler. You might try the authors to see where their implementations are.

Gene
  • 46,253
  • 4
  • 58
  • 96
  • Good clone detectors are not limited to statements. Having built clone detectors, esp. the kind that handle detection of structural similarities using grammars, I can tell you that if there are two (very similar) sets of expressions used to compute arguments, the clone detector can detect these. Whether it reports them (likely to be pretty small) is a matter of telling what the thresholds for reporting small clones is. What you describe as "PA" used to be called "common subexpression elimination", which was the inspiration for my particular clone detector. – Ira Baxter Jan 24 '14 at 07:41