9

I am trying to compile a large C file (specifically for MATLAB mexing). The C file is around 20 MB (available from the GCC bug tracker if you want to play around with it).

Here is the command I am running and the output to screen, below. This has been running for hours, and as you can see, optimization is already disabled (-O0). Why is this so slow? Is there a way I can make this faster?

(For reference: Ubuntu 12.04 (Precise Pangolin) 64 bit and GCC 4.7.3)

/usr/bin/gcc -c -DMX_COMPAT_32   -D_GNU_SOURCE -DMATLAB_MEX_FILE  -I"/usr/local/MATLAB/R2015a/extern/include" -I"/usr/local/MATLAB/R2015a/simulink/include" -ansi -fexceptions -fPIC -fno-omit-frame-pointer -pthread -O0 -DNDEBUG path/to/test4.c -o /tmp/mex_198714460457975_3922/test4.o -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.3-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04)
COLLECT_GCC_OPTIONS='-c' '-D' 'MX_COMPAT_32' '-D' '_GNU_SOURCE' '-D' 'MATLAB_MEX_FILE' '-I' '/usr/local/MATLAB/R2015a/extern/include' '-I' '/usr/local/MATLAB/R2015a/simulink/include' '-ansi' '-fexceptions' '-fPIC' '-fno-omit-frame-pointer' '-pthread' '-O0' '-D' 'NDEBUG' '-o' '/tmp/mex_198714460457975_3922/test4.o' '-v' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/4.7/cc1 -quiet -v -I /usr/local/MATLAB/R2015a/extern/include -I /usr/local/MATLAB/R2015a/simulink/include -imultilib . -imultiarch x86_64-linux-gnu -D_REENTRANT -D MX_COMPAT_32 -D _GNU_SOURCE -D MATLAB_MEX_FILE -D NDEBUG path/to/test4.c -quiet -dumpbase test4.c -mtune=generic -march=x86-64 -auxbase-strip /tmp/mex_198714460457975_3922/test4.o -O0 -ansi -version -fexceptions -fPIC -fno-omit-frame-pointer -fstack-protector -o /tmp/ccxDOA5f.s
GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu)
    compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/MATLAB/R2015a/extern/include
 /usr/local/MATLAB/R2015a/simulink/include
 /usr/lib/gcc/x86_64-linux-gnu/4.7/include
 /usr/local/include
 /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
GNU C (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) version 4.7.3 (x86_64-linux-gnu)
    compiled by GNU C version 4.7.3, GMP version 5.0.2, MPFR version 3.1.0-p3, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: c119948b394d79ea05b6b3986ab084cf

EDIT: a follow-on: I followed chqrlie's advice and tcc compiled my function in <5 seconds (I had to remove the -ansi flag only and turn "gcc" to "tcc"), which is pretty remarkable, really. I can only imagine the complexity of GCC.

When trying to then mex it, however, there is one other command mex typically needs. The second command is typically:

/usr/bin/gcc -pthread -Wl,--no-undefined -Wl,-rpath-link,/usr/local/MATLAB/R2015a/bin/glnxa64 -shared  -O -Wl,--version-script,"/usr/local/MATLAB/R2015a/extern/lib/glnxa64/mexFunction.map" /tmp/mex_61853296369424_4031/test4.o   -L"/usr/local/MATLAB/R2015a/bin/glnxa64" -lmx -lmex -lmat -lm -lstdc++ -o test4.mexa64

I cannot run this with tcc as some of these flags are not compatible. If I try to run this second compilation step with GCC, I get:

/usr/bin/ld: test4.o: relocation R_X86_64_PC32 against undefined symbol `mxGetPr' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status

EDIT: The solution appears to be clang. tcc can compile the file, but the arguments in the second step in mexing are incompatible with tcc's argument options. Clang is very fast and produces a nice, small, optimized file.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user650261
  • 2,115
  • 5
  • 24
  • 47
  • 5
    This C code is really weird. Could you consider generating different C? – fuz Oct 30 '15 at 19:39
  • Perhaps such a large source is generating massive intermediate files, which are being swapped to and from disc. – Weather Vane Oct 30 '15 at 19:39
  • @WeatherVane, but this is "only" 20MB. I recognize that's nontrivial, but surely large companies like Microsoft compile large files or large > 20MB files, no? Surely they do not wait 10 hours for compilations. – user650261 Oct 30 '15 at 19:44
  • Could you provide us with the content of `mex.h` for our own fiddling? – fuz Oct 30 '15 at 19:44
  • 3
    @FUZxxl, I'm going to ask you to either provide constructive+chatty comments or not comment at all. mex.h is included with MATLAB. If you have MATLAB, you have mex.h. Do not ask me to "generate different C" because it's "weird." I don't know what C being "weird" means, and asking me to change my project (especially without specifics) does nothing to help me solve my problem. – user650261 Oct 30 '15 at 19:47
  • 16
    @user650261 I'm not going to buy a $2000 software package just to answer a question to someone on Stack Overflow. If you want help, the convention is to make a self-contained example. Your code is not self-contained as it requires a proprietary `mex.h` header. A 20 MiB C expression is definitely weird. Perhaps it's possible to express the same concept differently, e.g. as a loop with parameters drawn from an array. – fuz Oct 30 '15 at 19:54
  • 6
    @user650261 please quote your source for "large companies compile large files". I'll wager a coffee they split them into smaller manageable modules, as they are certain to be in the know about "received wisdom". – Weather Vane Oct 30 '15 at 19:54
  • 2
    @user650261 Code like the one you generated is *not* normal. It's highly atypical to have code that large in a single function. Data, yes, but not code. I don't know how you generate this code, but you should really find a way to generate less code. – fuz Oct 30 '15 at 19:57
  • 7
    I will not download a source code from an external and potentially obscure source. However, I second FUZxxl and WeatherVane. Such a large file is **definitvely** not normal and a nightmare not only to maintain, but also edit and (possibly) debug. And @FUZxxl s comment clearly justified and not your's to tell him to shut up. It is **you** who asks for help and apparently have a problem with the the size of the file. Try breaking it down to smaller units. – too honest for this site Oct 30 '15 at 20:03
  • @WeatherVane Do you think that if this were split into more smaller files, that this would work well? For instance, if each file were only, say, 1/10th of this file (so, 2MB), are you saying that should then compile? Is there some limitation on file size itself? Why would that be? – user650261 Oct 30 '15 at 20:05
  • To those asking: this file is large because it is automatically generated. I can break it up if people think that would help, but I need to understand why that would help before I go through the trouble of doing that. – user650261 Oct 30 '15 at 20:06
  • @user650261 The reason the compiler chokes is *not* related to Mex, it's related to the fact that you have a 20 MiB expression and compiling such a large expression is something gcc is not designed for. I hypothesize that some algorithms gcc uses aren't quite linear in expression size. – fuz Oct 30 '15 at 20:08
  • @user650261 your question has been answered now. I am probably misquoting something I was told, but it's along the lines of *"If the statement has more than [some number of] expressions, it needs to be a function. If the file has more than [some number of] functions it needs to be a module."* If your source code was generated programmatically, surely it's a cinch to try different ways of generating it? – Weather Vane Oct 30 '15 at 20:28
  • 1
    Can you provide the header file `mex.h` or at least the type `mxArray` ? I cannot compile your code. 20MB should not pose a problem, but a single 20MB line might break many editors and a single 20MB expression is probably beyond the minimum complexity a compiler needs to handle to comply with the Standard. – chqrlie Oct 30 '15 at 20:55
  • @WeatherVane It can only be generated one way. I can break up the f statement into several separate statements or several separate files, but those are the only options. Both are nontrivial time commitments but if I have sufficient evidence that that will speed things up then I will do that. – user650261 Oct 30 '15 at 21:04
  • 4
    OK, I buy that the size of the file is the inherent issue here, and that others might have a similar issue, so I've reopened this. However, I am concerned that the linked file will go away at some point in the future, rendering the question less useful. That's why we tend to recommend people minimize the issue and place it within the question itself. This is a bit of an edge case. All I ask is that everyone keep their comments civil and on topic. – Brad Larson Oct 30 '15 at 21:36
  • Hi @BradLarson thank you. I would like to provide the file in some sort of longer-standing way. Do you have a suggestion of how I can provide larger code snippets on StackOverflow? – user650261 Oct 30 '15 at 22:06
  • 1
    I suggest you to try clang as well. With clang (shimming out mex.h) I was able to compile the entire thing with optimizations into an 80kB object file. – fuz Oct 31 '15 at 13:07
  • @FUZxxl, Clang works. If you put that down as an answer I can accept it as the solution. – user650261 Oct 31 '15 at 20:25
  • @user650261 To your question above, I took the liberty to ask it [on meta](http://meta.stackoverflow.com/questions/309256/how-are-we-supposed-to-provide-large-source-files-that-cannot-be-reduced) as I have no idea what to do in such a case either. – fuz Oct 31 '15 at 20:34
  • 1
    Interestingly, it also works if you compile with `gcc` with optimizations turned on. Turning off optimizations seems to cause you to run into a gcc bug. Can I use your source code file as an example for the bug report against gcc I'm going to file right now? – fuz Oct 31 '15 at 21:00
  • Sure. I'll try that later, that's an interesting catch. – user650261 Oct 31 '15 at 21:09
  • 3
    [Here](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68173) is my bug report against gcc. – fuz Nov 01 '15 at 12:35
  • 1
    @FUZxxl and OP, FYI (not that I expect you to be interested in continuing to futz about with this), Octave is essentially a GNU version of Matlab, and [includes mex files](https://www.gnu.org/software/octave/doc/interpreter/Getting-Started-with-Mex_002dFiles.html). Here's their [mex.h](http://octave.org/doxygen/4.1/de/d84/mex_8h_source.html). I don't know how similar it is to the MATLAB version. OP, this might be helpful for writing future MATLAB questions. And I wonder if there's an Octave way to do the code-generation you're doing, and if it behaves more sanely... – Kyle Strand Nov 02 '15 at 19:25
  • 2
    @user650261 BTW, after being convinced by the reactions to my bug report, I let the compiler run for longer. `gcc -O0 -c` terminated in roughly 40 minutes after consuming 7 GiB of RAM. – fuz Nov 02 '15 at 19:43
  • @FUZxxl: how large was the resulting object file? – chqrlie Nov 03 '15 at 21:25

3 Answers3

17

Nearly the entire file is one expression, the assignment to double f[24] = .... That's going to generate a gigantic abstract syntax tree tree. I'd be surprised if anything but a specialized compiler could handle that efficiently.

The 20 megabyte file itself may be fine, but the one giant expression may be what is causing the issue. Try as a preliminary step, splitting the line into double f[24] = {0} and then 24 assignments of f[0] = ...; f[1] = ... and see what happens. Worst case, you can split the 24 assignments into 24 functions, each in its own .c file, and compile them separately. This won't reduce the size of the AST, it will just reorganize it, but GCC is probably more optimized at handling many statements which together add up to a lot of code, vs. one huge expression.

The ultimate approach would be to generate the code in a more optimized manner. If I search for s4*s5*s6, for example, I get 77,783 hits. These s[4-6] variables don't change. You should generate a temporary variable, double _tmp1 = s4*s5*s6; and then use that instead of repeating the expression. You've just eliminated 311,132 nodes from your abstract syntax tree (assuming s4*s5*s6 is 5 nodes and _tmp1 is one node). That's that much less processing GCC has to do. This should also generate faster code (you won't repeat the same multiplication 77,783 times).

If you do this in a smart way in a recursive manner (e.g. s4*s5*s6 --> _tmp1, (c4*c6+s4*s5*s6) --> (c4*c6+_tmp1) --> _tmp2, c5*s6*(c4*c6+s4*s5*s6) --> c5*s6*_tmp2 -> _tmp3, etc...), you can probably eliminate most of the size of the generated code.

Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • 8
    I'll just point out C11 draft standard, `5.2.4.1 Translation limits, Section 1 [...]4095 characters in a logical source line[...]`. If this is a *single 20MB expression*, the compiler isn't required to be able to compile it at all. – EOF Oct 30 '15 at 20:08
  • @EOF Op could break the expression up into multiple lines to cirumvent this limit but there are other limits his code violates. – fuz Oct 30 '15 at 20:09
  • Thank you for this helpful reply. I have thought about trying to simplify it by precomputing stuff - that is a later step I was planning to take, but I may do it now if that will solve the problem. I am wondering if you can talk more about the parse tree mechanism - why would parsing take so long? I've noticed that compilation can detect syntax errors very quickly, so why would parsing for compilation take so long in this case. Sorry for the questions - I am just trying to understand what would be causing the problem so I don't go down a rabbit hole of changes only to hit dead ends. – user650261 Oct 30 '15 at 20:10
  • @user650261 gcc is doing some transformations on the parse tree (notable, register allocation). Some of these transformations may have a runtime worse than O(n), making them extremely slow for large trees. As expressions are almost never larger than about 1000 characters this doesn't matter too much but then you come and serve an expression that is 20000 times larger than this upper boundary for realistic code. – fuz Oct 30 '15 at 20:14
  • 2
    @user650261: Hmm I may be wrong that it's the parsing that takes that long. I stuck a parse error at the end of the file and it caught it within a few seconds. Unless it somehow optimizes for catching these errors without parsing it first, which I doubt. As FUZxxl says, it may be the transformations on the parse tree which are taking up so much time. – Claudiu Oct 30 '15 at 20:17
  • 4
    From my experiments, gcc's parse phase takes about 10 seconds, the parse tree analysis and code generation must use algorithms with higher complexity than O(n), maybe O(n*2) or possibly worse, it literally takes hours to complete. `tcc` does not build a parse tree, it generates code on the fly in a single pass. The output is very large (42MB of code+data) but it does it very quicky and even 38MB of iterative code should execute in decent time, much less than 1 second. – chqrlie Oct 30 '15 at 21:42
15

Upon testing, I found that the Clang compiler seems to have less problems compiling large files. Although Clang consumed almost a gigabyte of memory during compilation, it successfully turned OP's source code form into a 70 kB object file. This works for all optimization levels I tested.

gcc was also able to compile this file quickly and without consuming too much memory if optimization is turned on. This bug in gcc comes from the large expression in OPs code which places a huge burden on the register allocator. With optimizations turned on, the compiler performs an optimization called common subexpression elimination which is able to remove a lot of redundancy from OPs code, reducing both compilation time and object file size to manageable values.

Here are some tests with the testcase from the aforementioned bug report:

$ time gcc5 -O3 -c -o testcase.gcc5-O3.o testcase.c
real    0m39,30s
user    0m37,85s
sys     0m1,42s
$ time gcc5 -O0 -c -o testcase.gcc5-O0.o testcase.c
real    23m33,34s
user    23m27,07s
sys     0m5,92s
$ time tcc -c -o testcase.tcc.o testcase.c
real    0m2,60s
user    0m2,42s
sys     0m0,17s
$ time clang -O3 -c -o testcase.clang-O3.o testcase.c
real    0m13,71s
user    0m12,55s
sys     0m1,16s
$ time clang -O0 -c -o testcase.clang-O0.o testcase.c
real    0m17,63s
user    0m16,14s
sys     0m1,49s
$ time clang -Os -c -o testcase.clang-Os.o testcase.c
real    0m14,88s
user    0m13,73s
sys 0m1,11s
$ time clang -Oz -c -o testcase.clang-Oz.o testcase.c
real    0m13,56s
user    0m12,45s
sys     0m1,09

This is the resulting object file size:

    text       data     bss      dec        hex filename
39101286          0       0 39101286    254a366 testcase.clang-O0.o
   72161          0       0    72161      119e1 testcase.clang-O3.o
   72087          0       0    72087      11997 testcase.clang-Os.o
   72087          0       0    72087      11997 testcase.clang-Oz.o
38683240          0       0 38683240    24e4268 testcase.gcc5-O0.o
   87500          0       0    87500      155cc testcase.gcc5-O3.o
   78239          0       0    78239      1319f testcase.gcc5-Os.o
69210504    3170616       0 72381120    45072c0 testcase.tcc.o
fuz
  • 88,405
  • 25
  • 200
  • 352
6

Try Fabrice Bellard's tiny C compiler tcc from http://tinycc.org:

chqrlie$ time tcc -c test4.c

real    0m1.336s
user    0m1.248s
sys     0m0.084s

chqrlie$ size test4.o
   text    data     bss     dec     hex filename
38953877        3170632       0 42124509        282c4dd test4.o

Yes, that's 1.336 seconds on a pretty basic PC!

Of course I cannot test the resulting executable, but the object file should be linkable with the rest of your program and libraries.

For this test, I used a dummy version of file mex.h:

typedef struct mxArray mxArray;
double *mxGetPr(const mxArray*);
enum { mxREAL = 0 };
mxArray *mxCreateDoubleMatrix(int nx, int ny, int type);

gcc still has not finished compiling...

EDIT: gcc managed to hog my Linux box so badly I cannot connect to it anymore:(

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • If I can ask a quick follow-up question: Since you seem to have experience with tcc, do you know the proper syntax for adding an external mex.h file? I read the docs and tried: 'tcc test4.c -Idir"/usr/local/MATLAB/R2015a/extern/include" -Idir"/usr/local/MATLAB/R2015a/simulink/include"' but unfortunately this returned "test4.c:1: error: include file 'mex.h' not found" even though it was in that directory. – user650261 Oct 30 '15 at 22:05
  • remove `dir` in the options: `tcc test4.c -I"/usr/local/MATLAB/R2015a/extern/include" -I"/usr/local/MATLAB/R2015a/simulink/include"`. The options are the same as `gcc`'s. – chqrlie Oct 30 '15 at 22:12
  • 1
    @user650261 the command line syntax of `tcc` is mostly the same as that of gcc. – fuz Oct 30 '15 at 22:15