According to the available documentation for Chapel, (whole-)array statements like
A = B + alpha * C; // with A, B, and C being arrays, and alpha some scalar
are implemented in the language as the following forall iteration:
forall (a,b,c) in zip(A,B,C) do
a = b + alpha * c;
Thus, array statements are by default executed by a team of parallel threads. Unfortunately, this also seems to completely preclude the (either partial or complete) vectorization of such statements. This can lead to performance surprises for programmers who are used to languages like Fortran or Python/Numpy (where the default behavior typically is to have array statements be only vectorized).
For codes that use (whole-)array statements with arrays of small to moderate size, the loss of vectorization (confirmed by Linux hardware performance counters) and the significant overhead inherent to parallel threads (which are unsuited to effectively exploit the fine-grained data-parallelism available in such problems) can result in significant loss of performance. As an example consider the following versions of Jacobi iteration that all solve the same problem on a domain of 300 x 300 zones:
Jacobi_1
employs array-statements, as follows:
/*
* Jacobi_1
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (whole-)array statements.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
X[n+1,1..n] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
XNew[1..n,1..n] =
( X[0..n-1,1..n] + X[2..n+1,1..n] +
X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;
// update X with next approximation
X[1..n,1..n] = XNew[1..n,1..n];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_2
employs serial for-loops throughout (i.e. only (auto-)vectorization
by the back-end C-compiler is allowed):
/*
* Jacobi_2
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using (serial) for-loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in 1..n do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
for i in 1..n do
for j in 1..n do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
for i in 1..n do
for j in 1..n do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Jacobi_3
, finally, has the innermost loops vectorized and only the
outermost loops threaded:
/*
* Jacobi_3
*
* This program (adapted from the Chapel distribution) performs
* niter iterations of the Jacobi method for the Laplace equation
* using both parallel and serial (vectorized) loops.
*
*/
config var n = 300; // size of n x n grid
config var niter = 10000; // number of iterations to perform
proc main() {
const Domain = {0..n+1,0..n+1}; // domain including boundary points
var iteration = 0; // iteration counter
var X, XNew: [Domain] real = 0.0; // declare arrays:
// X stores approximate solution
// XNew stores the next solution
for j in vectorizeOnly(1..n) do
X[n+1,j] = 1.0; // Set south boundary values to 1.0
do {
// compute next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
X[i,j+1] + X[i,j-1] ) / 4.0;
// update X with next approximation
forall i in 1..n do
for j in vectorizeOnly(1..n) do
X[i,j] = XNew[i,j];
// advance iteration counter
iteration += 1;
} while (iteration < niter);
writeln("Jacobi computation complete.");
writeln("# of iterations: ", iteration);
} // main
Running these codes on a laptop with 2 processor-cores and using two
parallel threads, one finds that Jacobi_1
is (surprisingly)
more than ten times slower than Jacobi_2
, which itself is (expectedly)
~1.6 times slower than Jacobi_3
.
Unfortunately, this default behavior makes array statements completely unattractive for my use cases, even for algorithms which would benefit enormously from the more concise notation, and readability that (whole-)array statements can provide.
Are there ways for the user in Chapel to change this default behavior?
That is, can a user customize the default parallelization of whole-array
statements in a way that such array statements, as used in Jacobi_1
, will
behave either like the code in Jacobi_2
(which would be useful for code development and debugging purposes), or the code in Jacobi_3
(which, among those three, would be the method of choice for production calculations)?
I have tried to achieve this by plugging calls to "vectorizeOnly()
" into
the definition of "Domain" above, but to no avail.