7

My goal is take source codes in different languages (mostly C, C++, Obj-C and Haskell) and tell every kind of statistics about them. (eg. number of variables, functions, memory allocations, complexity etc.)

LLVM seemed to be a perfect tool for this, because I can generate the bitcode for these languages and with LLVM's customizable passes I can almost do anything. For the C family it works fine, take a C program (test.c) for example:

#include <stdio.h>
int main( )
{
    int num1, num2, sum;
    printf("Enter two integers: ");
    scanf("%d %d", &num1, &num2); 
    sum = num1 + num2;
    printf("Sum: %d",sum);
    return 0;
}

Then I run:

clang -emit-llvm test.c -c -o test.bc
opt -load [MY AWESOME PASS] [ARGS]

Voila, I have almost everything I need:

1 instcount - Number of Add insts
 4 instcount - Number of Alloca insts
 3 instcount - Number of Call insts
 3 instcount - Number of Load insts
 1 instcount - Number of Ret insts
 2 instcount - Number of Store insts
 1 instcount - Number of basic blocks
14 instcount - Number of instructions (of all types)
12 instcount - Number of memory instructions
 1 instcount - Number of non-external functions

I would like to achieve the same with Haskell programs. Take test.hs:

module Test where

quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser = filter (< p) xs
        greater = filter (>= p) xs

However when I do

ghc -fllvm -keep-llvm-files -fforce-recomp test.hs
opt -load [MY AWESOME PASS] [ARGS]

I get the following results, which seem to be completely useless for my purposes (mentioned at the beginning of this post), because they are obviously not true for these few lines of code. I guess it has something to do with GHC, because the newly created .ll file is 52Kb itself, while the .ll file for the C program is only 2Kb.

31 instcount - Number of Add insts
  92 instcount - Number of Alloca insts
   2 instcount - Number of And insts
  30 instcount - Number of BitCast insts
  24 instcount - Number of Br insts
  22 instcount - Number of Call insts
 109 instcount - Number of GetElementPtr insts
  17 instcount - Number of ICmp insts
  54 instcount - Number of IntToPtr insts
 326 instcount - Number of Load insts
  65 instcount - Number of PtrToInt insts
  22 instcount - Number of Ret insts
 206 instcount - Number of Store insts
   8 instcount - Number of Sub insts
  46 instcount - Number of basic blocks
1008 instcount - Number of instructions (of all types)
 755 instcount - Number of memory instructions
  10 instcount - Number of non-external functions

My question is how should I proceed to be able to compare Haskell code with the others without having these huge numbers? Is it even possible? Should I continue using GHC for generating LLVM IR? What other tools should I use?

Erik Kaplun
  • 37,128
  • 15
  • 99
  • 111
Adam
  • 418
  • 3
  • 12
  • 4
    Haskell is a much higher-level language than C, so the IR generated from C code is going to look much, much more like the original source than any Haskell code is. C's execution model is well-adapted to the sequential model low-level languages like LLVM IR and ASM use, but Haskell has lots more to consider: laziness, passing typeclass dictionaries, performing pattern-matching, handling runtime errors for partial functions, etc. Comparing the IR output might tell you *something*, but it won't say much about the subjective metric of "complexity". – Alexis King Oct 27 '15 at 23:57
  • 7
    "I get the following results, which seem to be completely useless... because they are obviously not true for this few lines of code." What makes this so obvious? Perhaps these results indicate that you should rethink what you "know". – Daniel Wagner Oct 28 '15 at 00:11
  • 3
    I think you should at least compare quick sort in C and quick sort in Haskell. – ymonad Oct 28 '15 at 02:10
  • First of all thanks for your answers. My goal is still to have some analysis done on the source code (mainly for Haskell, but if there is something better for the others I'm open to anything). If LLVM is not the right tool for that I can go with something other. What do you suggest? I would like to get results like - number of functions - number of function calls - number of variables (can be skipped for Haskell I guess) - etc. Obviously I don't want to write a text parser from scratch, that is the reason I started playing with compilers. – Adam Oct 28 '15 at 22:14
  • 1
    LLVM is compiler "back-end", thus I think it is not the right tool for comparing source codes. LLVM language files generated by GHC contain many auxiliary functions, thus they add several function names that are not present in the Haskell source file. For processing Haskell source you will be better off with a Haskell library like haskell-src-exts. – Lemming Aug 02 '16 at 08:21

0 Answers0