c++ stack efficient for multicore application

Question

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:

struct densityStack {
    vector<int> density; //will store the sum of boolean varaibles
    int card; //will store the amount of elements we summed over for normalizing at the end

    densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
        density = vector<int> (size, 0);
        card = 0;
        }

    void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
        for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
            *valStack += *newVal;
        card++;
        }

    void savef(const char * fname){ //method outputting into a file
        ofstream out(fname);
        out.precision(10);
        out << card << "\n"; //saving the cardinal in first line 
        for(auto val = density.begin(); val != density.end(); ++val)
            out << << (double) *val/card << "\n";
        out.close();
        }
};

Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.

My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).

Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.

If the code works fine, this might be more suitable for [Code Review](http://codereview.stackexchange.com/) — Sami Kuhmonen, Aug 26 '15 at 18:11
"every time a CPU core has data it will call push_back" - where's the mutex guarding this? Forget efficient, start with correct code. The correct way to do multi-core accumulation is to accumulate subsums per thread, and then combine those. — MSalters, Aug 26 '15 at 20:05
I went ahead and deleted my comment as it was more of a 'try this' suggestion than an answer. However, to @MSalters on adding a mutex: I do not see the benefit of a mutex for a vector that is going to remain a single size, never moved, only have addition operations invoked on it. I don't see anything not thread unsafe about it. For my own learning sake: what could get corrupted/go wrong in this scenario? (I'm intentionally ignoring the savef method for this thought) — Diniden, Aug 26 '15 at 20:25
@Diniden - The vector is not a fixed size, it is having elements appended via push_back. push_back is not atomic or threadsafe, so it needs to be wrapped in a mutex. — patros, Aug 26 '15 at 20:33
@patros - His push_back method does not call push_back on the vector. His vector is fully allocated in his ctor. — Diniden, Aug 26 '15 at 20:35
@Diniden - Didn't notice his push_back implementation. You're right that it won't change the vector size, but it is vulnerable to race conditions so it still needs to be wrapped in a mutex. — patros, Aug 26 '15 at 20:45
@patros - That is what I am trying to understand: what race conditions? If you ignore the savef method, simply adding to the vector does not have anything that could mess up the operations as far as I can tell. — Diniden, Aug 26 '15 at 20:49
@Diniden - The line *valStack += *newVal is not threadsafe. It is possible for a thread to get interrupted after computing (the implicit) *valStack + *newVal but before writing it back to *valStack. In this case the += syntax somewhat obscures that this is not an atomic operation. — patros, Aug 26 '15 at 20:53
@patros: Even `*valStack = *newVal` would be unsafe. `std::atomic` is atmoic, regular `int` is not. And if you think x86 guarantees it regardless, think again: C++ optimizers may assume that a non-atomic, non-mutexed int is not accessed by another thread. x86 can only guarantee that 32 bits writes to memory are atomic, not that those writes happen in the first place. — MSalters, Aug 26 '15 at 23:32
Thanks for ideas guys, I decided to allocate one density stack per core, added a `std::mutex` for each one and then a global function that could average them before outputting the result in a file. It gets the job done, neater, maybe even a bit faster! Many thanks again. — Learning is a mess, Aug 27 '15 at 15:44

score 2 · Accepted Answer · answered Aug 26 '15 at 19:10

How are you synchronizing your shared densityStack instance?

From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

c++ stack efficient for multicore application

1 Answers1