Is there a way to customize the default parallelization behavior of whole-array statements in Chapel?

Question

According to the available documentation for Chapel, (whole-)array statements like

A = B + alpha * C;   // with A, B, and C being arrays, and alpha some scalar

are implemented in the language as the following forall iteration:

forall (a,b,c) in zip(A,B,C) do
   a = b + alpha * c;

Thus, array statements are by default executed by a team of parallel threads. Unfortunately, this also seems to completely preclude the (either partial or complete) vectorization of such statements. This can lead to performance surprises for programmers who are used to languages like Fortran or Python/Numpy (where the default behavior typically is to have array statements be only vectorized).

For codes that use (whole-)array statements with arrays of small to moderate size, the loss of vectorization (confirmed by Linux hardware performance counters) and the significant overhead inherent to parallel threads (which are unsuited to effectively exploit the fine-grained data-parallelism available in such problems) can result in significant loss of performance. As an example consider the following versions of Jacobi iteration that all solve the same problem on a domain of 300 x 300 zones:

Jacobi_1 employs array-statements, as follows:

/*
 *  Jacobi_1
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (whole-)array statements.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  X[n+1,1..n] = 1.0;                 // Set south boundary values to 1.0

  do {

    // compute next approximation
    XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

    // update X with next approximation
    X[1..n,1..n] = XNew[1..n,1..n];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_2 employs serial for-loops throughout (i.e. only (auto-)vectorization by the back-end C-compiler is allowed):

/*
 *  Jacobi_2
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (serial) for-loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in 1..n do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    for i in 1..n do
      for j in 1..n do  
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    for i in 1..n do
      for j in 1..n do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_3, finally, has the innermost loops vectorized and only the outermost loops threaded:

/*
 *  Jacobi_3
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using both parallel and serial (vectorized) loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in vectorizeOnly(1..n) do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Running these codes on a laptop with 2 processor-cores and using two parallel threads, one finds that Jacobi_1 is (surprisingly) more than ten times slower than Jacobi_2, which itself is (expectedly) ~1.6 times slower than Jacobi_3.

Unfortunately, this default behavior makes array statements completely unattractive for my use cases, even for algorithms which would benefit enormously from the more concise notation, and readability that (whole-)array statements can provide.

Are there ways for the user in Chapel to change this default behavior? That is, can a user customize the default parallelization of whole-array statements in a way that such array statements, as used in Jacobi_1, will behave either like the code in Jacobi_2 (which would be useful for code development and debugging purposes), or the code in Jacobi_3 (which, among those three, would be the method of choice for production calculations)?

I have tried to achieve this by plugging calls to "vectorizeOnly()" into the definition of "Domain" above, but to no avail.

score 2 · Accepted Answer · answered Aug 06 '18 at 21:45

Chapel's intent is to support vectorization automatically within the per-task serial loops that are used to implement forall loops (for cases that are legally vectorizable). Yet that capability is not well-supported today, as you note (even the vectorizeOnly() iterator that you're using is only considered prototypical).

I'll mention that we tend to see better vectorization results when using Chapel's LLVM back-end than we do with the (default) C back-end, and that we've seen even better results when utilizing Simon Moll's LLVM-based Region Vectorizer (Saarland University). But we've also seen cases where the LLVM back-end underperforms the C back-end, so your mileage may vary. But if you care about vectorization, it's worth a try.

To your specific question:

Are there ways for the user in Chapel to change this default behavior?

There are. For explicit forall loops, you can write your own parallel iterator which can be used to specify a different implementation strategy for a forall loop than our default iterators use. If you implement one that you like, you can then write (or clone and modify) a domain map (background here) to govern how loops over a given array are implemented by default (i.e., if no iterator is explicitly invoked). This permits end-users to specify different implementation policies for a Chapel array than the ones we support by default.

With respect to your three code variants, I'm noting that the first uses multidimensional zippering which is known to have significant performance problems today. This is the likely main cause of performance differences between it and the others. For example, I suspect that if you rewrote it using the form forall (i,j) in Domain ... and then used +/-1 indexing per-dimension, you'd see a significant improvement (and, I'd guess, performance that's much more comparable to the third case).

For the third, I'd be curious whether the benefits you're seeing are due to vectorization or simply due to multitasking since you've avoided the performance problem of the first and the serial implementation of the second. E.g., have you checked to see whether using the vectorizeOnly() iterator added any performance improvement over the same code without that iterator (or used tools on the binary files to inspect whether vectorization is occurring?)

In any Chapel performance study, make sure to throw the --fast compiler flag. And again, for best vectorization results, you might try the LLVM back-end.

Thanks a lot for your answer. I will go through the links you provided to see whether I can come up with a customized solution (this will have to wait until tomorrow, though). To answer your question regarding my tests: what I see in the first test (according to Linux perf tools) is a complete lack of vectorization, quite possibly exacerbated by the multidimensional zippering problem that you mentioned (I will do some more testing on that). Tests 2 and 3 both led to the same amount of fully vectorized instructions being executed, so the speed-up of Test 3 is due to multitasking. — , Aug 06 '18 at 22:00
Great, thanks for the additional notes. Feel free to check in on gitter if you have specific questions about user-defined iterators or domain maps that we can help with. But if you're seeing the vectorization from case 3 that you expected, I suspect you'll also see if if you write case 1 using a forall loop and indexing to avoid the zippered multidimensional loop. — Brad, Aug 06 '18 at 22:06
What you suspected is (almost) correct. After rewriting Jacobi_1 to use "forall ij in Domain do" statements (and +1/-1 indexing per dimension) instead of array operations I almost get the same performance as from Jacobi_3 (vectorization is slightly worse though). — , Aug 13 '18 at 13:07
As it turned out after more testing, whole-array statements in Chapel perform very badly not just for small to modest-size arrays (see above) but also for large arrays. For arrays of size 600^3, array statements are still about a factor of two slower than *serial* for loops. I now consider this clearly a bug (or actually a choice of an unsuitable algorithm to implement whole-array operations) rather than a "feature" and will file an issue on github. — , Aug 13 '18 at 13:24
For completeness: in all my tests, gcc has been employed as the back-end compiler, and the following compiler flags have been used: --fast --ccflags '-Ofast'. I have also tried out the LLVM back end in a few cases, but this gave worse vectorization for these benchmarks than gcc. In any case, on the old hardware (SSE2/3 instructions) I ran all of this, lack of proper vectorization can at worst impact performance by only a factor of two. There is no way it could explain a performance penalty of (more than) an order of magnitude. Hence this is clearly an implementation/algorithmic issue. — , Aug 13 '18 at 15:47
I've commented on the GitHub issue filed by @user10172900 above. — Brad, Aug 29 '18 at 22:23

Is there a way to customize the default parallelization behavior of whole-array statements in Chapel?

1 Answers1