False Sharing in Hogwild! Algorithms

Question

I am trying to implement the Hogwild! Linear SVM algorithm, but I am running into false sharing problems with my implementation.

My code is below, but the background is that I am trying to compute which samples fail my test and make and update which is given by that set of vectors. Hogwild! (as far as I understand) simply makes the update on the same memory totally asynchronously. This would create "noise" in a mathematical sense due to the improperly times updates.

Sadly, as I try to do these async updates, the L1 cache is invalidated and has to be re-fetched. Below is my code.

Is there a good way to fix this false sharing without losing the asynchronous? (I am more of a mathematician than a computer scientist). This mentions that using different optimization levels can fix this.

void update(size_t epoch, const double *X_data, const int *X_indices,
            const int *X_indptr, const int *Y, double *W,
            double reg, double step_size, size_t nodes,
            size_t X_height, size_t X_width) {

  size_t i, j;
  double step = step_size/(1 + epoch);
  double c;

#pragma omp parallel shared(W, X_data, X_indices, X_indptr, Y) private(i, j, c)
  {
#pragma for schedule(static)
    for (i=0;i<X_height;i++) {
      c = 0.0;
      for (j=X_indptr[i];j<X_indptr[i+1];j++)
        c += X_data[j]*W[X_indices[j]]; // Scaled to discount the MPI scaling                                                

      if (Y[i]*c > 1)
        continue;

      for (j=X_indptr[i];j<X_indptr[i+1];j++)
        W[X_indices[j]] += step*Y[i]*X_data[j]/(X_height*nodes);
    } // END FOR OMP PARALLELIZED                                                                                            

#pragma for schedule(static) // Might not do much                                                                            
    for (i=0;i<X_width;i++) // (1 - self.reg*step)*self.W/self.nodes +                                                       
      W[i] *= (1 - reg*step)/nodes;
  }
}

Are you sure you don't have race conditions? Are the values of `X_indptr[i]` unique? If they are not it appears to me that two threads could try to write to memory using the same index. — Z boson, Nov 26 '15 at 10:04
Added the tag. No, the indexes are not unique, but the race conditions are "desired" in the sense that mathematically the "noise" created by them decays out. The paper shows this. They are not desired in the sense of the actual computation is causing the caches to invalidate and creating a memory bound — , Nov 28 '15 at 04:00
I vote for this as "most absurd-looking, yet actually meaningful, title of the year" :) — Tom Zych, Nov 28 '15 at 11:08

Gilles · Accepted Answer · 2015-11-28T10:54:51.763

I don't know much about the algorithm you mention, but it looks to me it globally is much more memory bound than compute bound. To convince you, here is a quick rewrite of your code:

void update( size_t epoch, const double *X_data, const int *X_indices,
             const int *X_indptr, const int *Y, double *W,
             double reg, double step_size, size_t nodes,
             size_t X_height, size_t X_width ) {

    const double step = step_size / ( 1 + epoch );
    const double ratio = step / ( X_height * nodes );
    const double tapper = ( 1 - reg * step ) / nodes;

    #pragma omp parallel 
    {
        #pragma omp for schedule( static )
        for ( size_t i = 0; i < X_height; i++ ) {
            double c = 0;
            for ( int j = X_indptr[i]; j < X_indptr[i+1]; j++ ) {
                c += X_data[j] * W[X_indices[j]]; // Scaled to discount the MPI scaling
            }
            if ( Y[i] * c <= 1 ) {
                double ratioYi = Y[i] * ratio;
                for ( int j = X_indptr[i]; j < X_indptr[i+1]; j++ ) {
                    // ATTENTION: this will collide across threads and have undefined result BY DESIGN
                    W[X_indices[j]] += ratioYi * X_data[j];
                }
            }
        } // END FOR OMP PARALLELIZED

        #pragma omp for schedule( static ) // Might not do much
        for ( size_t i = 0; i < X_width; i++ ) { // (1 - self.reg*step)*self.W/self.nodes +
            W[i] *= tapper;
        }
    }
}

As you can see, I did a few changes. Most of them are purely stylistic (like indentation, spacing, variable declaration location, etc), but some are truly important. For example, by defining ratio, and ratioYi as shallow in loops as possible, I remove (or help the compiler to remove, would it have done it) most the of computations from the code. It becomes all of a sudden obvious that the code almost only accesses data and computes very little. Therefore, unless you have a multi-socket (or multi memory controller) shared memory machine, you won't see much speed-up (if any) out of this OpenMP parallelisation.

Moreover, the "by design" race conditions the algorithm accepts while updating W in parallel, even if justified in the paper you pointed, keep on puzzling me. I still wouldn't like to rely on an undefined behaviour for a computational code (or any code for that matter).

Anyway, assuming the code does what you want, scales and is indeed only limited by L1 cache invalidations due to false sharing (or indeed true sharing here, since you authorise data collisions), a possible "solution" would be to increase the size of your W array, by for example doubling its size, and to only store meaningful data every second index. In your algorithm, this wouldn't change anything. Simply, you would have to multiply by 2 X_indices. By doing that, you would even more limit the likelihood of false sharing by mechanically dividing by two the number of useful data stored in a single cache line. However, again, for a memory-bound code, increasing the memory consumption might not be the best idea ever... But since it is super straightforward a test, just try it and see if it gives you any benefit.

A final note also to say that your code had a bug in the OpenMP parallelisation, whereby you had #pragma for instead of #pragma omp for. Not sure if this was a typo while copying here, but better mentioning it just in case.

Nice code cleanup. I guess a private `W_local` vector won't work (easily) because of `c += X_data[j] * W[X_indices[j]];` I mean `c` depends on all the full sum I think and so partial sums won't work. `#pragma atomic` before writing to `W` would fix the problem but I don't know about performance. If `if ( Y[i] * c <= 1 ) ` is not true often then this should be okay. — Z boson, Nov 26 '15 at 12:02
@Zboson Yeah, I thought abut the atomic too. But the bulk of the paper the OP was referring to was about not caring about race conditions in the update of `W` and that the algorithm would converge all the same, but only with better speed-up... So I simply added a comment inside the code telling the race condition was "by design", although I wouldn't do that myself in one of my codes. — Gilles, Nov 26 '15 at 12:10
Oh, good point (already up voted). I find the claim a bit surprising. Race conditions at best still cause slow downs even if the result turns out okay. — Z boson, Nov 26 '15 at 12:16
Thank you. Makes a lot of sense. Memory bounds have been my problem most of this time. My next idea was to try a vector of locks to lock computation for elements or blocks of the W vector. I will try this later, but can you think of any way to remove the memory access bound of my computation? — , Nov 28 '15 at 04:09
@aidan.plenert.macdonald, what about sorting `X_indices` and then acting on the sorted array? — Z boson, Nov 28 '15 at 10:40
@aidan.plenert.macdonald memory bound problems are the most complicated ones to optimise. The typical approach is to tile your problem to better fit in cache and improve data reuse. But here, I hardly see it an option since you have a very random access pattern. Doubling the length of the `W` array as suggested would improve the false sharing issue, but worsen the data fitting in cache. Finally, a note to say that I'm not convince the algorithm is right, even with the "per design" collision, since the flushing of `W` is very unpredictable. — Gilles, Nov 28 '15 at 11:02
@Gilles You are right. I think something is up with the algorithm. I will spend more time looking at the paper. I'll report back if there was something I missed. I wasn't thinking about cache flushes previously. — , Nov 28 '15 at 15:29
http://arxiv.org/pdf/1506.06438v1.pdf provides a further analysis. They mention cache synchronization here and prove convergence under this system as well. Furthermore they introduce fixed point arithmetic as a improvement to the computation. — , Nov 28 '15 at 15:45
I made all the changes you did, and this cut the computation time in half! The biggest player was adding the omp after the pragma ... amazing what writing code properly does! — , Nov 28 '15 at 16:20

False Sharing in Hogwild! Algorithms

1 Answers1