Calculate rowSums in Chapel for a matrix

Question

Continuing my Chapel adventures...

I have a matrix A.

var idx = {1..n};
var adom = {idx, idx};
var A: [adom] int;
//populate A;

var rowsums: [idx] int;

What is the most efficient way to populate rowsums?

SO won't let me make edits of less than six characters, but note that the curly brackets in the declaration of `idx` are probably unintended (they suggest that `adom` is an associative domain whose indices are themselves 1D domains rather than the 2D domain that I think you intended). — Brad, Aug 16 '17 at 23:14
Also, as a general comment—when you declare domains whose index sets you don't plan to change, making them `const` rather than `var` provides helpful semantic information to the compiler's optimizations (e.g., "you will never have to reallocate arrays declared in terms of this domain"). Chapel doesn't do a ton of optimizations for const domains at present, but will do more and more as time goes on. — Brad, Aug 16 '17 at 23:15

ben-albrecht · Accepted Answer · 2017-08-17T19:13:58.033

2

The most efficient solution is hard to define. However, here is one way to compute rowsums that is both parallel and elegant:

config const       n = 8;          // "naked" n would cause compilation to fail
const indices = 1..n;              // tio.chpl:1: error: 'n' undeclared (first use this function)
const adom = {indices, indices};
var A: [adom] int;

// Populate A
[(i,j) in adom] A[i, j] = i*j;

var rowsums: [indices] int;


forall i in indices {
  rowsums[i] = + reduce(A[i, ..]);
}

writeln(rowsums);

Try it online!

This is utilizing the + reduction over array slices of A.

Note that both the forall and + reduce introduce parallelism to the program above. It may be more efficient to only use a for loop, avoiding task-spawning overhead, if the size of indices is sufficiently small.

edited Aug 17 '17 at 19:13

answered Aug 16 '17 at 23:12

ben-albrecht

1,785
10
23

Can you explain this line? `[(i,j) in adom] A[i, j] = i*j;` And what if indices is really big? What then? – Brian Dolan Aug 16 '17 at 23:13
2

There's a [SO post](https://stackoverflow.com/questions/43728540/is-var-in-distributed-variable-equivalent-to-forall) for that! It's short-hand for `forall (i,j) in adom do A[i, j] = i*j;` – ben-albrecht Aug 16 '17 at 23:14
2

I'll mention that there's a longstanding intention to add "partial reductions" to the language (as in prior languages like ZPL) which have the effect of reducing a subset of a rectangular domain's dimensions. This is likely to result both in a more succinct description of the computation as well as a more efficient implementation, particularly for sparse or distributed arrays. I believe the current proposed syntax is something like `+ reduce (resultShape=[1..n, 1]) A` – Brad Aug 16 '17 at 23:20
2

For large arrays, you'd definitely want the `forall` instead of `for`. If the array needed to be distributed over nodes, we could distribute `A` and `rowsums` with the `Block` distribution (row-wise). – ben-albrecht Aug 16 '17 at 23:21
Would be great to benchmark the distributed-mode test either. Btw, have you thought about a Cray-hosted replica of a hosted-[Chapel]-IDE? `-IDE` environment is granted to be freely downloaded and populated on such a "home-base" kind of infrastructure, plus the administrative restrictions may be removed and thus may provide for both the **distributed-computing** infrastructure and need not have a 60-second limit, before a tested piece of code gets administratively `kill`-ed. What does Cray Chapel-Team think about this way of moving REPL-alike IDE efforts right to a fully working system? – user3666197 Aug 17 '17 at 16:21

user3666197 · Answer 2 · 2017-08-17T06:12:03.583

A few hints
to make the code actually run-live in both `SEQ` and `PAR` mode:

Besides a few implementation details, the above stated @bencray's assumption about the assumed overhead costs for a PAR setup, which may favor a purely serial processing in a SEQ setup, was not experimentally confirmed. It is fair to also note here, that a distributed mode was not tested on live <TiO>-IDE due to obvious reasons, whereas a small-if-not-tiny-scale distributed implementation is by far more an oxymoron, than a scientifically meaningful experiment to run.

Facts matter

A rowsums[] processing, even at a smallest possible scale of 2x2, was in the SEQ mode yet slower, than the same for 256x256 in the PAR mode.

Good job, chapel Team, indeed cool results on optimum alignment for harnessing the compact silicon resources to the max in PAR!

For records on exact run-time performance, ( ref. self-documented tables ) below, or do not hesistate to visit the live-IDE-run ( ref.'d above ) and experiment on your own.

Readers may also recognise extrinsic noise on small-scale experimentations, as O/S- and hosted-IDE-related processes intervene with resources-usage and influence onto the <SECTION-UNDER-TEST> runtime performance via adverse CPU / Lx-CACHE / memIO / process / et al conflicts, which fact exludes these measurements from being used for some generalised interpretations.

Hope all will enjoy the chapel lovely `[TIME]` results
_{demonstrated across the growing [EXPSPACE]-scaled computing landscapes}

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;

//nst max_idx =    123456;                   // seems to be too fat  for <TiO>-IDE to allocate                  <TiO>--   /wrappers/chapel: line 6: 24467 Killed
const max_idx =      4096;
//nst max_idx =      8192;                   // seems to be too long for <TiO>-IDE to let it run [SEQ] part     <TiO>--  The request exceeded the 60 second time limit and was terminated
//nst max_idx =     16384;                   // seems to be too long for <TiO>-IDE to let it run [PAR] part too <TiO>--   /wrappers/chapel: line 6: 12043 Killed
const indices = 1..max_idx;

const   adom  = {indices, indices};
var A: [adom] int;

[(i,j) in adom] A[i, j] = i*j;               // Populate A[,]

var rowsums: [indices] int;

/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
for       i in indices {                     // SECTION-UNDER-TEST--
  rowsums[i] = + reduce(A[i, ..]);           // SECTION-UNDER-TEST--
}                                            // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop();

/* 
                                               <SECTION-UNDER-TEST> took     8973 [us] to run in [SEQ] mode for    2 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took    28611 [us] to run in [SEQ] mode for    4 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took    58824 [us] to run in [SEQ] mode for    8 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took   486786 [us] to run in [SEQ] mode for   64 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  1019990 [us] to run in [SEQ] mode for  128 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  2010680 [us] to run in [SEQ] mode for  256 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  4154970 [us] to run in [SEQ] mode for  512 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  8260960 [us] to run in [SEQ] mode for 1024 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took 15853000 [us] to run in [SEQ] mode for 2048 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took 33126800 [us] to run in [SEQ] mode for 4096 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took      n/a [us] to run in [SEQ] mode for 8192 elements on <TiO>-IDE

   ============================================ */


/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
forall    i in indices {                     // SECTION-UNDER-TEST--
  rowsums[i] = + reduce(A[i, ..]);           // SECTION-UNDER-TEST--
}                                            // SECTION-UNDER-TEST--
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop();
/*
                                               <SECTION-UNDER-TEST> took  12131 [us] to run in [PAR] mode for    2 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took   8095 [us] to run in [PAR] mode for    4 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took   8023 [us] to run in [PAR] mode for    8 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took   8156 [us] to run in [PAR] mode for   64 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took   7990 [us] to run in [PAR] mode for  128 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took   8692 [us] to run in [PAR] mode for  256 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  15134 [us] to run in [PAR] mode for  512 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  16926 [us] to run in [PAR] mode for 1024 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took  30671 [us] to run in [PAR] mode for 2048 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took 105323 [us] to run in [PAR] mode for 4096 elements on <TiO>-IDE
                                               <SECTION-UNDER-TEST> took 292232 [us] to run in [PAR] mode for 8192 elements on <TiO>-IDE

   ============================================ */



writeln( rowsums,
        "\n <SECTION-UNDER-TEST> took ", aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [SEQ] mode for ", max_idx, " elements on <TiO>-IDE",
        "\n <SECTION-UNDER-TEST> took ", aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] to run in [PAR] mode for ", max_idx, " elements on <TiO>-IDE"
         );

This is what makes chapel so great

Thanks for developing and improving such great computing tool for the HPC.

you have parceled together quite a few bits over SO, you should blog these results somewhere and be the first voice of the Chapel community. — Brian Dolan, Aug 17 '17 at 16:40
Thanks for the compliment, Brian. IMHO, Brad & his Cray-[Chapel]-Team has made immense amount of work during the recent about a decade and half so as to get us to the point, where we see [Chapel] and its performance today. My hat is raised for that accumulated amount of work & unique HPC expertise created. **Keep Walking!** — user3666197, Aug 17 '17 at 16:53

Calculate rowSums in Chapel for a matrix

2 Answers2

A few hints
to make the code actually run-live in both `SEQ` and `PAR` mode:

Facts matter

Hope all will enjoy the chapel lovely `[TIME]` results
_{demonstrated across the growing [EXPSPACE]-scaled computing landscapes}

This is what makes chapel so great

Linked

Calculate rowSums in Chapel for a matrix

2 Answers2

A few hintsto make the code actually run-live in both SEQ and PAR mode:

Facts matter

Hope all will enjoy the chapel lovely [TIME] resultsdemonstrated across the growing [EXPSPACE]-scaled computing landscapes

This is what makes chapel so great

Linked

A few hints
to make the code actually run-live in both `SEQ` and `PAR` mode:

Hope all will enjoy the chapel lovely `[TIME]` results
_{demonstrated across the growing [EXPSPACE]-scaled computing landscapes}