What is the better implementation strategy?

Question

This question is about the best strategy for implementing the following simulation in C++.

I'm trying to make a simulation as a part of a physics research project, which basically tracks the dynamics of a chain of nodes in space. Each node contains a position together with certain parameters (local curvature, velocity, distance to neighbors etc…) which all evolve trough time.

Each time step can be broken down to these four parts:

Calculate local parameters. The values are dependent on the nearest neighbors in the chain.
Calculate global parameters.
Evolving. The position of each node is moved a small amount, depending on global and local parameters, and some random force fields.
Padding. New nodes are inserted if the distance between two consecutive nodes reach a critical value.

(In addition, nodes can get stuck, which make them inactive for the rest of the simulation. The local parameters of inactive nodes with inactive neighbors, will not change, and does not need any more calculation.)

Each node contains ~ 60 bytes, I have ~ 100 000 nodes in the chain, and i need to evolve the chain about ~ 1 000 000 time steps. I would however like to maximize these numbers, as it would increase the accuracy of my simulation, but under the restriction that the simulation is done in reasonable time (~hours). (~30 % of the nodes will be inactive.)

I have started to implement this simulation as a doubly linked list in C++. This seems natural, as I need to insert new nodes in between existing ones, and because the local parameters depends on the nearest neighbors. (I added an extra pointer to the next active node, to avoid unnecessary calculation, whenever I loop over the whole chain).

I'm no expert when it comes to parallelization (or coding for that matter), but I have played around with OpenMP, and I really like how I can speed up for loops of independent operations with two lines of code. I do not know how to make my linked list do stuff in parallel, or if it even works (?). So I had this idea of working with stl vector. Where Instead of having pointers to the nearest neighbors, I could store the indices of the neighbors and access them by standard lookup. I could also sort the vector by the position the chain (every x'th timestep) to get a better locality in memory. This approach would allowed for looping the OpenMP way.

I'm kind of intimidated by the idea, as I don't have to deal with memory management. And I guess that the stl vector implementation is way better than my simple 'new' and 'delete' way of dealing with Nodes in the list. I know I could have done the same with stl lists, but i don't like the way I have to access the nearest neighbors with iterators.

So I ask you, 1337 h4x0r and skilled programmers, what would be a better design for my simulation? Is the vector approach sketched above a good idea? Or are there tricks to play on linked list to make them work with OpenMP? Or should I consider a totally different approach?

The simulation is going to run on a computer with 8 cores and 48G RAM, so I guess I can trade a lot of memory for speed.

Thanks in advance

Edit: I need to add 1-2 % new nodes each time step, so storing them as a vector without indices to nearest neighbors won't work unless I sort the vector every time step.

How local are your local parameters, and how are the global properites calculated? — Jonathan Dursi, Jul 12 '11 at 10:54
@Jonathan Dursi: The local parameters are local in the sense that they only depend on nearest neighbors in the chain. Example: the local curvature is approximated by fitting three points (a node and it's neighbors) to a circle. Example of a global parameter is the chain length, the sum of the distance between consecutive nodes. Did that answer your question? — jonalm, Jul 12 '11 at 11:14
I thought I would mention - in case you hadn't come across it - the gnu_parallel extension (MSVC has similar). Basically std::library parallelised with openmp behind the scenes. If you write your loops using stl (i.e. for_each, accumulate, inner_product) etc - then you can write a serial version of your code first (to get the simulation right) and then parallelise it almost trivially thereafter (almost, because you still need to make your containers - however you implement them- threadsafe). I found it helpful. http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html — Tom, Jul 12 '11 at 14:32

score 2 · Accepted Answer · answered Jul 12 '11 at 13:24

This is a classic tradeoff question. Using an array or std::vector will make the calculations faster and the insertions slower; using a doubly linked list or std::list will make the insertions faster and the calculations slower.

The only way to judge tradeoff questions is empirically; which will work faster for your particular application? All you can really do is try it both ways and see. The more intense the computation and the shorter the stencil (eg, the computational intensity -- how many flops you have to do per amount of memory you have to bring in) the less important a standard array will be. But basically you should mock up an implementation of your basic computation both ways and see if it matters. I've hacked together a very crude go at something with both std::vector and std::list; it is probably wrong in any of a numer of ways, but you can give it a go and play with some of the parameters and see which wins for you. On my system for the sizes and amount of computation given, list is faster, but it can go either way pretty easily.

W/rt openmp, yes, if that's the way you're going to go, you're hands are somewhat tied; you'll almost certainly have to go with the vector structure, but first you should make sure that the extra cost of the insertions won't blow away any benifit of multiple cores.

#include <iostream>
#include <list>
#include <vector>
#include <cmath>
#include <sys/time.h>
using namespace std;

struct node {
    bool stuck;
    double x[2];
    double loccurve;
    double disttoprev;
};

void tick(struct timeval *t) {
    gettimeofday(t, NULL);
}

/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
    struct timeval now;
    gettimeofday(&now, NULL);
    return (double)(now.tv_sec - t->tv_sec) +
        ((double)(now.tv_usec - t->tv_usec)/1000000.);
}

int main()
{
    const int nstart = 100;
    const int niters = 100;
    const int nevery = 30;
    const bool doPrint = false;
    list<struct node>   nodelist;
    vector<struct node> nodevect;

    // Note - vector is *much* faster if you know ahead of time 
    //  maximum size of vector
    nodevect.reserve(nstart*30);

    // Initialize
    for (int i = 0; i < nstart; i++) {
        struct node *mynode = new struct node;
        mynode->stuck = false;
        mynode->x[0] = i; mynode->x[1] = 2.*i;
        mynode->loccurve = -1;
        mynode->disttoprev = -1;

        nodelist.push_back( *mynode );
        nodevect.push_back( *mynode );
    }

    const double EPSILON = 1.e-6;
    struct timeval listclock;
    double listtime;

    tick(&listclock);
    for (int i=0; i<niters; i++) {
        // Calculate local curvature, distance

        list<struct node>::iterator prev, next, cur;
        double dx1, dx2, dy1, dy2;

        next = cur = prev = nodelist.begin();
        cur++;
        next++; next++;
        dx1 = prev->x[0]-cur->x[0];
        dy1 = prev->x[1]-cur->x[1];

        while (next != nodelist.end()) {
            dx2 = cur->x[0]-next->x[0];
            dy2 = cur->x[1]-next->x[1];

            double slope1 = (dy1/(dx1+EPSILON));
            double slope2 = (dy2/(dx2+EPSILON));

            cur->disttoprev = sqrt(dx1*dx1 + dx2*dx2 );

            cur->loccurve = ( slope1*slope2*(dy1+dy2) +
                              slope2*(prev->x[0]+cur->x[0]) -
                              slope1*(cur->x[0] +next->x[0]) ) /
                            (2.*(slope2-slope1) + EPSILON);

            next++;
            cur++;
            prev++;
        }

        // Insert interpolated pt every neveryth pt
        int count = 1;
        next = cur = nodelist.begin();
        next++;
        while (next != nodelist.end()) {
            if (count % nevery == 0) {
                struct node *mynode = new struct node;
                mynode->x[0] = (cur->x[0]+next->x[0])/2.;
                mynode->x[1] = (cur->x[1]+next->x[1])/2.;
                mynode->stuck = false;
                mynode->loccurve = -1;
                mynode->disttoprev = -1;
                nodelist.insert(next,*mynode);
            }
            next++;
            cur++;
            count++;
        }
    }
                                                               51,0-1        40%

    struct timeval vectclock;
    double vecttime;

    tick(&vectclock);
    for (int i=0; i<niters; i++) {
        int nelem = nodevect.size();
        double dx1, dy1, dx2, dy2;
        dx1 = nodevect[0].x[0]-nodevect[1].x[0];
        dy1 = nodevect[0].x[1]-nodevect[1].x[1];

        for (int elem=1; elem<nelem-1; elem++) {
            dx2 = nodevect[elem].x[0]-nodevect[elem+1].x[0];
            dy2 = nodevect[elem].x[1]-nodevect[elem+1].x[1];

            double slope1 = (dy1/(dx1+EPSILON));
            double slope2 = (dy2/(dx2+EPSILON));

            nodevect[elem].disttoprev = sqrt(dx1*dx1 + dx2*dx2 );

            nodevect[elem].loccurve = ( slope1*slope2*(dy1+dy2) +
                              slope2*(nodevect[elem-1].x[0] +
                                      nodevect[elem].x[0])  -
                              slope1*(nodevect[elem].x[0] +
                                      nodevect[elem+1].x[0]) ) /
                            (2.*(slope2-slope1) + EPSILON);

        }

        // Insert interpolated pt every neveryth pt
        int count = 1;
        vector<struct node>::iterator next, cur;
        next = cur = nodevect.begin();
        next++;
        while (next != nodevect.end()) {
            if (count % nevery == 0) {
                struct node *mynode = new struct node;
                mynode->x[0] = (cur->x[0]+next->x[0])/2.;
                mynode->x[1] = (cur->x[1]+next->x[1])/2.;
                mynode->stuck = false;
                mynode->loccurve = -1;
                mynode->disttoprev = -1;
                nodevect.insert(next,*mynode);
            }
            next++;
            cur++;
            count++;
        }
    }
    vecttime = tock(&vectclock);

    cout << "Time for list: " << listtime << endl;
    cout << "Time for vect: " << vecttime << endl;

    vector<struct node>::iterator v;
    list  <struct node>::iterator l;

    if (doPrint) {
        cout << "Vector: " << endl;
        for (v=nodevect.begin(); v!=nodevect.end(); ++v) {
             cout << "[ (" << v->x[0] << "," << v->x[1] << "), " << v->disttoprev << ", " << v->loccurve << "] " << endl;
        }

        cout << endl << "List: " << endl;
        for (l=nodelist.begin(); l!=nodelist.end(); ++l) {
             cout << "[ (" << l->x[0] << "," << l->x[1] << "), " << l->disttoprev << ", " << l->loccurve << "] " << endl;
        }

    }

    cout << "List size is " << nodelist.size() << endl;
}

Thanks for the answers and input everyone. Accepted Jonathans answer cause he took the effort to write up a test for me. Thanks! Think Ill go for the vector version because I then can easily parallelize. — jonalm, Jul 13 '11 at 08:47

score 1 · Answer 2 · answered Jul 11 '11 at 23:11

1

Assuming that creation of new elements happens relatively infrequently, I would take the sorted vector approach, for all the reasons you've listed:

No wasting time following pointers/indices around
Take advantage of spatial locality
Much easier to parallelise

Of course, for this to work, you'd have to make sure that the vector was always sorted, not simply every k-th timestep.

answered Jul 11 '11 at 23:11

Oliver Charlesworth

267,707
33
569
680

Creation of elements happens frequent in time. I need to add 1-2 % new nodes each timestep. Then I still have to sort the vector every timestep. – jonalm Jul 12 '11 at 09:53
@jonalm: Aha. You should probably add that fact to your original question. – Oliver Charlesworth Jul 12 '11 at 10:23

score 0 · Answer 3 · answered Jul 12 '11 at 14:15

This looks like a nice exercise for parallel programming students.

You seem to have a data structure that leads itself naturally to distribution, a chain. You can do quite a bit of work over subchains that are (semi)statically assigned to different threads. You might want to deal with the N-1 boundary cases separately, but if the subchain lengths are >3 then those are isolated from each other.

Sure, between each step you'll have to update global variables, but variables such as chain length are simple parallel additions. Just calculate the length of each subchain and then add those up. If your subchains are 100000/8 long, the single-threaded piece of work is the addition of those 8 subchain lengths between steps.

If the growth of nodes is highly non-uniform, you might want to rebalance the subchain lengths every so often.

What is the better implementation strategy?

3 Answers3