0

Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.

So I come up with below approach:

 int Max{1000000};
 //SimResult is some struct with well-defined default value.
 std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
 int LastAdded{0};
 void fill(int RandSeed)
 { 
      Simulator sim{RandSeed};
      while(LastAdded < Max)
      {
           // Do some work to bring foo to the desired state
           //The duration of this work is subject to randomness
           vec[LastAdded++] 
                 = sim.GetResult();//Produces SimResult. 
      }
 }
 main()
 { 
       //launch a bunch of std::async that start
       auto fut1 = std::async(fill,1);
       auto fut2 = std::async(fill,2);
       //maybe some more tasks.


      fut1.get();
      fut2.get();
      //do something with the results in vec. 
 }

The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.

Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?

user3666197
  • 1
  • 6
  • 50
  • 92
willem
  • 2,617
  • 5
  • 26
  • 38
  • You should implement a helper function which can be called by the threads to lock the vector and insure that threads cannot modify the same location at the same time. – SPlatten Mar 12 '20 at 10:39
  • 1
    Generally you let each thread modify only a limited range of the array, and make sure that the ranges don't overlap for the threads. – Some programmer dude Mar 12 '20 at 10:40
  • It probably doesn't help that std::generate can also be executed in parallel in C++17. Downside is that you don't have control over the used threads or thread pool (since external) – gast128 Mar 12 '20 at 11:03
  • Otherwise when the tasks are long I would suggest that threads pick up the free spots themselves when they have completed the previous task. (thx stackoverflow for the 5 minute edit limit) – gast128 Mar 12 '20 at 11:11
  • Intel's *"Threading Building Blocks"* offers a concurrent vector... https://software.intel.com/en-us/node/506079 – Mark Setchell Mar 12 '20 at 11:36

2 Answers2

3

One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.

The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.

std::vector<SimResult> generateResults(size_t N_runs, double seed) 
{
    std::vector<SimResult> results(N_runs);
    #pragma omp parallel for
    for(auto i = 0; i < N_runs; i++)
    {
        auto sim = Simulator(seed + i);
        results[i] = sim.GetResult();
    }
}

Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:

#pragma omp parallel for schedule(dynamic, 16)

which would give each thread chunks of 16 items to work on at a time.

  • Thanks for the suggestion to be carefull with the random generators. Regarding OMP parallel: Behind the scenes, wouldn't that divide the list into a fixed number of subsets. Then I would have the same problem as for the the answer of churill: Some threads might be finished, but others are still going. – willem Mar 12 '20 at 10:59
  • Sorry, I subsequently saw your second comment - I've just added more information about scheduling. – Ryan Pepper Mar 12 '20 at 11:00
  • Ok, that would work. Basically, schedule(dynamic,1) is an automated approach for what I want to accomplish above. – willem Mar 12 '20 at 11:27
  • Yes, exactly. OpenMP is very powerful; you can also spawn tasks to a pool of threads, but for your use case this is not really necessary. It's also built in to pretty much every modern compiler, with the exception of Apple's version of Clang on Macs. – Ryan Pepper Mar 12 '20 at 13:36
  • @willem: OpenMP's `schedule(dynamic)` seems to address the problem you have, of different iterations maybe costing different time. http://jakascorner.com/blog/2016/06/omp-for-scheduling.html came up in a google search for openmp schedule dynamic, and looks good. You definitely want to avoid fine-grained locking or having different threads write their results into alternating array elements, though; that will lead to false sharing of cache lines. What you want is each thread working mostly in its own region of the array, and only near the end have other thread jump in nearby to help. – Peter Cordes Mar 12 '20 at 20:07
2

Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example

Update

to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.

void fill(int RandSeed, std::mutex &nextItemMutex)
{ 
      Simulator sim{RandSeed};
      size_t workingIndex;
      while(true)
      {
          {
               // enter critical area
               std::lock_guard<std::mutex> nextItemLock(nextItemMutex);

               // Acquire next item
               if(LastAdded < Max)
               {
                   workingIndex = LastAdded;
                   LastAdded++;
               } 
               else 
               {
                   break;
               }
               // lock is released when nextItemLock goes out of scope
          }

           // Do some work to bring foo to the desired state
           // The duration of this work is subject to randomness
           vec[workingIndex] = sim.GetResult();//Produces SimResult. 
      }
 }

Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.

Version 2:

To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:

void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{ 
      Simulator sim{RandSeed};
      size_t workingIndex;
      while(true)
      {
          {
               std::lock_guard<std::mutex> nextItemLock(nextItemMutex);

               if(LastAdded < Max)
               {
                   workingIndex = LastAdded;
                   LastAdded += blockSize;
               } 
               else 
               {
                   break;
               }
          }
          
          for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
              vec[i] = sim.GetResult();//Produces SimResult. 
      }
 }

Simple Version

void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{ 
      Simulator sim{RandSeed};
      for(size_t i = partitionStart; i < partitionEnd; i++)
      {
           // Do some work to bring foo to the desired state
           // The duration of this work is subject to randomness
           vec[i] = sim.GetResult();//Produces SimResult. 
      }
 }
main()
{ 
    //launch a bunch of std::async that start
    auto fut1 = std::async(fill,1, 0, Max / 2);
    auto fut2 = std::async(fill,2, Max / 2, Max);

    // ...
}
Community
  • 1
  • 1
Lukas-T
  • 11,133
  • 3
  • 20
  • 30
  • Thanks for this suggestion. Problem is, like I said, that "tasks take a different amount of time that is hard to predict, so it is not easy to divide the work evenly over the threads." If I do it like this, then likely there will be a lot of waiting for the last thread to finish. – willem Mar 12 '20 at 10:52
  • @willem Sorry, I forgot about that and adjusted my answer. – Lukas-T Mar 12 '20 at 11:25
  • Thanks. The mutex approach would work, but, like you, I worry about performance. (It indeed depends on how much time each simulation costs; in my case, this is actually not so much.) To avoid a performance hit, I looked into std::atomic, but I couldn't figure you how to do it. – willem Mar 12 '20 at 11:31
  • 1
    You could modify this approach to acquire a whole block of indices, instead of only one. For example increase `LastAdded` by up to 10 each time, then work on those 10 items. So you require 10 times less synchronisation. – Lukas-T Mar 12 '20 at 11:39
  • 1
    All threads contending for a mutex lock/unlock on *every* iteration sounds horrible. And then they store to adjacent array elements, leading to false sharing of that cache line as well? Yuck. Each simulation iteration would have to be very expensive to amortize that into being negligible, especially if you have lots of threads (like a big Xeon or EPYC with 32 or 64 logical cores). – Peter Cordes Mar 12 '20 at 20:13
  • @PeterCordes Sorry, but I don't get what you want to say. I mentioned the problem of numerous synchronisation and thus proposed the version 2. – Lukas-T Mar 12 '20 at 20:24
  • 2
    Oh I see. That got buried in the middle of your answer. If that's the version you really recommend, put a larger title on it, smaller headings on the others, and mention it up front. Maybe don't including the naive version at all. It does still have threads writing to alternating blocks, though, which isn't idea unless they're 128-byte aligned chunks of 128 bytes. (L2 prefetchers try to complete a pair of 64-byte lines on Intel CPUs.) Or if the block size is large enough, the false sharing between writes from different cores to the same cache line at block boundaries will be a minor factor. – Peter Cordes Mar 12 '20 at 20:35
  • Yes, I thought about putting the second version on top, but since it builds up on the first one and, I think, is not better for every case, I'll leave it there. But thanks anyway, the aspects of low-level optimization for better caching sound quite interesting. – Lukas-T Mar 13 '20 at 10:07
  • @PeterCordes thanks for your comments. A lot of the things you say worry me now: I am actually performing these simulations on a big Xeon processor. Churill: I like Version 2 a lot, also because of the reasons Peter mentions. I think I will implement and test that one, and see how it compares to the simple version. Question: What about implementing LastAdded as Atomic? Is it possible or am I misunderstanding the possibilities of atomics? Other question to Peter: Where can I read more on L2 prefetchers and false sharing of cache lines? – willem Mar 13 '20 at 12:07
  • 2
    @willem: yes, instead of locking I'd use `.fetch_add(256)` or larger on an atomic position counter instead of a mutex. (If I was going to use this interleaving method at all.) It's still a full memory barrier and an almost guaranteed cache miss but we can make it cheaper and infrequent. Hyperthreading can probably hide most of the throughput penalty if you have that, assuming both logical cores don't stall at the same time. The batch size is something you can tune, and maybe reduce as you approach the end. Or leave it to OpenMP schedule(dynamic). It probably does something like this. – Peter Cordes Mar 13 '20 at 12:13