1

I currently have a project going on, in which a large dataset has to be created using HDF5. Now, the naive implementation is all nice and dandy, but very slow. The slow part is in the calculation (10x slower than write) which I cannot speed up anymore, but maybe parallelization is possible.

I guess I could use a simple #pragma omp parallel for but the dataspace.write(..) method should be squential for speed reasons (maybe it doesnt matter). See this picture for example.

It should be noted that, because of the dimensionality, the write function uses a chunked layout of the same size as the buffer (in reality around 1Mb)

/*
------------NAIVE IMPLEMENTATION-----------------
|T:<calc0><W0><calc1><W1><calc2><W2>............|
|-----------------------------------------------|
|----------PARALLEL IMPLEMENTATION--------------|
|-----------------------------------------------|
|T0:<calc0----><W0><calc4>.....<W4>.............|
|T1:<calc1---->....<W1><calc5->....<W5>.........|
|T2:<calc2--->.........<W2>calc6-->....<W6>.....|
|T3:<calc3----->...........<W3><calc7-->...<W7>.|
------------DIFFERENT IMPLEMENTATION-------------
i.e.: Queuesize=4
T0:.......<W0><W1><W2><W3><W4><W5><W6>..........|
T1:<calc0><calc3>.....<calc6>...................|
T2:<calc1>....<calc4>.....<calc7>...............|
T3:<calc2>........<calc5>.....<calc8>...........|


T          Thread
<calcn---> Calculation time
<Wn>       Write data n. Order *important*
.          Waiting
*/

Codeexample:

#include <chrono>
#include <cmath>
#include <iostream>
#include <memory>

double calculate(float *buf, const struct options *opts) {
  // dummy function just to get a time reference
  double res = 0;
  for (size_t i = 0; i < 10000; i++)
    res += std::sin(i);
  return 1 / (1 + res);
}

struct options {
  size_t idx[6];
};

class Dataspace {
public:
  void selectHyperslab(){}; // selects region in disk space
  void write(float *buf){}; // write buf to selected disk space
};

int main() {
  size_t N = 6;
  size_t dims[6] = {4 * N, 4 * N, 4 * N, 4 * N, 4 * N, 4 * N},
         buf_offs[6] = {4, 4, 4, 4, 4, 4};
  // dims: size of each dimension, multiple of 4
  // buf_offs: size of buffer in each dimension

  // Calcuate buffer size and allocate
  // the size of the buffer is usually around 1Mb
  // and not a float but a compund datatype
  size_t buf_size = buf_offs[0];
  for (auto off : buf_offs)
    buf_size *= off;
  std::unique_ptr<float[]> buf{new float[buf_size]};

  struct options opts;        // options parameters, passed to calculation fun
  struct Dataspace dataspace; // dummy Dataspace. Supplied by HDF5

  size_t i = 0;
  size_t idx0, idx1, idx2, idx3, idx4, idx5;
  auto t_start = std::chrono::high_resolution_clock::now();
  std::cout << "[START]" << std::endl;
  for (idx0 = 0; idx0 < dims[0]; idx0 += buf_offs[0])
    for (idx1 = 0; idx1 < dims[1]; idx1 += buf_offs[1])
      for (idx2 = 0; idx2 < dims[2]; idx2 += buf_offs[2])
        for (idx3 = 0; idx3 < dims[3]; idx3 += buf_offs[3])
          for (idx4 = 0; idx4 < dims[4]; idx4 += buf_offs[4])
            for (idx5 = 0; idx5 < dims[5]; idx5 += buf_offs[5]) {
              i++;
              opts.idx[0] = idx0;
              opts.idx[1] = idx1;
              opts.idx[2] = idx2;
              opts.idx[3] = idx3;
              opts.idx[4] = idx4;
              opts.idx[5] = idx5;

              dataspace.selectHyperslab(/**/); // function from HDF5
              calculate(buf.get(), &opts);     // populate buf with data
              dataspace.write(buf.get());      // has to be sequential
            }
  std::cout << "[DONE] " << i << " calls" << std::endl;
  std::chrono::duration<double> diff =
      std::chrono::high_resolution_clock::now() - t_start;
  std::cout << "Time: " << diff.count() << std::endl;
  return 0;
}

Code should work right out of the box.

I already took a quick look into OpenMP, but I can't wrap my head around yet. Can anyone give me a hint/working example? I am not good with parallelization, but wouldn't a writer-thread with a bufferqueue work? Or is using OpenMP overkill anyways and pthreads suffice? Any help is kindly appreciated,

cheers

Jonas K
  • 168
  • 1
  • 8
  • I suspect your example is not representative enough to give a good answer. Does each loop iteration work on different parts of `buf` (as `opts.idx` indicates? Do you write the entire contents of `buf` in each iteration or only the specific non-overlapping part that was modified in this loop iteration? – Zulan May 29 '17 at 16:26
  • yeah, I tried to keep it minimized. But yes, during iteration each calculation completely fills up the buffer. It then gets passed to the write call thereby writing the full contents of _buf_ to disk at the specified location (uses chunked layout in dataspace for slicing) – Jonas K May 29 '17 at 16:36

1 Answers1

2

Your first parallel implementation idea is by far the simplest to implement. Making a queue and a dedicated I/O thread might perform better, but is significantly more difficult to implement using OpenMP.

Below is a simple example of how a parallel version could look like. The most important aspects are:

  1. Shared data: Make sure that there is no race condition on any data that is shared among threads. For example each thread must have it's own buf and opts as they are clearly modified in parallel with no restriction. The simplest way is to define the variables locally within a parallel region. Also loop idxn, at least for the inner loops, and i must be defined locally. You cannot compute i like you did - this would create a dependency between each loop iteration and prevent parallelization.
  2. Apply pragma omp for worksharing to the loop. Due to the small amount of iterations in each dimension, it is advisable to apply collapse. This will distribute the work of multiple nested loops. The optimal value for collapse will expose enough parallel work for your the number of threads available for your program, but not create too much overhead or hinder single-thread optimization of inner loops. You might want to try different values.
  3. Protect writing the data with a critical section. Only one thread at a time will enter the section. This is most likely necessary for correctness (depending on how it is implemented in hdf5). Apparently selectHyperslab will control how write will operate, so it must be in the same critical section.

Put together, it could look like this:

#pragma omp parallel
{
  // define EVERYTHING that is modified locally to each thread!
  std::unique_ptr<float[]> buf{new float[buf_size]};
  struct options opts;
  // Try different values for collapse if performance is not satisfactory
  #pragma omp for collapse(3)
  for (size_t idx0 = 0; idx0 < dims[0]; idx0 += buf_offs[0])
    for (size_t idx1 = 0; idx1 < dims[1]; idx1 += buf_offs[1])
      for (size_t idx2 = 0; idx2 < dims[2]; idx2 += buf_offs[2])
        for (size_t idx3 = 0; idx3 < dims[3]; idx3 += buf_offs[3])
          for (size_t idx4 = 0; idx4 < dims[4]; idx4 += buf_offs[4])
            for (size_t idx5 = 0; idx5 < dims[5]; idx5 += buf_offs[5]) {
              size_t i = idx5 + idx4 * dims[5] + ...;
              opts.idx[0] = idx0;
              opts.idx[1] = idx1;
              opts.idx[2] = idx2;
              opts.idx[3] = idx3;
              opts.idx[4] = idx4;
              opts.idx[5] = idx5;

              calculate(buf.get(), &opts);     // populate buf with data
              #pragma omp critical
              {
                // I do assume that this function selects where/how data 
                // will be written so you *must* protected it
                // Only one thread can do this at a time.
                dataspace.selectHyperslab(/**/); // function from HDF5
                dataspace.write(buf.get());      // has to be sequential
              }
            }
}
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Cool thank you, I can work with that! Totally forgot to protect the chunk selection/write. Just as a follow up, this does not however work in an ordered manner right? – Jonas K May 29 '17 at 21:36
  • 1
    This is not ordered, you can however use `omp for ordered` / `omp ordered` to enforce that. (https://stackoverflow.com/questions/13224155/how-does-the-omp-ordered-clause-work) – Zulan May 29 '17 at 22:05