1

I have automatically generated a huge, but very simple .cpp file. It defines a class:

#include <QString>
#include <map>

class CTrigramFrequencyTable_English 
{
public:
    CTrigramFrequencyTable_English();

private:
    std::map<QString /*trigram*/, quint64 /*count*/> _trigramFrequencyTable;
    const quint64 _totalTrigramCount;
};

and puts 10k lines of the following kind in the constructor:

_trigramFrequencyTable[QString("and")] = 48760ull;

I have started compiling this .cpp about 10 minutes ago, and it's still ongoing. Is there any way to achieve what I want and reduce compilation time? Why is it even taking so long? I've seen quite a few libraries with 3k-5k lines of regular code, even with templates, and it compiled very fast.

Bottom line - I don't want to put my data into a resource file and parse this file, I wanted to compile the data directly into the binary.

P. S. 10k lines file compiles in about 30 seconds in debug configuration; in release I waited for 10 minutes and terminated the process.

Violet Giraffe
  • 32,368
  • 48
  • 194
  • 335
  • 1
    @MrEricSir Why do you think this would overflow the stack? There's nothing indicating that the code is called _recursively_. – Captain Obvlious Dec 13 '14 at 19:54
  • Nope, there certainly won't be stack overflow. If anything, it will just run slowly. – Violet Giraffe Dec 13 '14 at 20:01
  • Which compiler are you using ([GCC](http://gcc.gnu.org/) or [Clang/LLVM](http://clang.llvm.org/)), which version (`g++ 4.9`?)? which optimization flags? – Basile Starynkevitch Dec 13 '14 at 20:12
  • @BasileStarynkevitch: MSVC 2012, default debug configuration (/Od). /O2 in release. – Violet Giraffe Dec 13 '14 at 20:16
  • Out of curiosity, what program is it (I can guess a little bit)? Is it free software? Is it a research project (do you have some papers, I could be interested!) – Basile Starynkevitch Dec 13 '14 at 20:31
  • @BasileStarynkevitch: no research, just a standalone library for use in my personal project. What I'm trying to do is guess what encoding a piece of text is encoded with, and I intend to do that by comparing statistical characteristics of a given set of symbols decoded with different encodings to a reference distribution. Sort of a dictionary-based encoding brute force to determine how to decode text properly. It's open source. I haven't commited it yet, but here's the repo: https://github.com/VioletGiraffe/text-encoding-detector The piece of code in my question is the reference distribution. – Violet Giraffe Dec 13 '14 at 21:15
  • Then you really should put your huge data in a standalone C not C++ file. – Basile Starynkevitch Dec 13 '14 at 21:18

1 Answers1

4

By experience (in MELT, with recent GCC -e.g. 4.8 or 4.9) with generated C++ (sort of C like) code, the compilation time of a routine is quadratic in size (in number of lines) of that routine as soon as you want the compiler to optimize.

Register allocation and instruction scheduling algorithms inside any optimizing compiler are hard and complex!

In your particular case, you should consider changing your C++ code generating script to emit something like:

struct my_trigram_pair_st {
    const char*name;
    unsigned long long freq;
};
const struct my_trigram_pair_st my_trigrams[]= {
  { "and", 48760ull },
  // zillions of similar lines 
  { NULL, NULL }
};

and preferably, emit that as C (not C++) code. It can be C code, since const char* is a plain C-string (for literal strings like "and"), and the freq is a plain number. Change also your generator to emit legal C99 strings (so don't emit Ô inside, but \303\224 or preferably \xc3\x94 ...)

Then, adjust your C++ program to use that:

extern "C" const struct my_trigram_pair_st my_trigrams[];
for (int i=0; my_trigrams[i].name != nullptr; i++) 
   _trigramFrequencyTable[QString(my_trigrams[i].name)]
       = my_trigrams[i].freq;

Here you are converting UTF8 const char* to QString-s at run time.

If you need your script to generate functions, make your script split these functions into smaller functions (of e.g. at most a thousand lines each).

Alternatively put your huge data in e.g. some Sqlite and/or Json file.... (you could even have some Sqlite file with JSON inside).

You could also disable optimizations in your compiler when compiling that particular file.... Or you could wait much longer (hours).

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • This can't be C because strings must be UTF-8, there can be all sorts of non-ASCII symbols there, which is why I have to depend on Qt for string processing. But I can still try your suggestions and use `QString` instead of `char*`. – Violet Giraffe Dec 13 '14 at 20:23
  • And, of course, there's a question of how portable my code is since I'm still passing non-ASCII characters as `char*`, but it works in Visual Studio in a UTF-8 encoded source. – Violet Giraffe Dec 13 '14 at 20:24