2

I have a code like this:

thread_local CustomAllocator* ts_alloc = nullptr;

struct AllocatorSetup
{
    AllocatorSetup( int threadNum )
    {
        static std::vector<CustomAllocator> vec( (size_t)omp_get_max_threads() );
        ts_alloc = &vec.at( threadNum );
    }
    ~AllocatorSetup()
    {
        ts_alloc->resetArena();
        ts_alloc = nullptr;
    }
    AllocatorSetup() = delete;
    AllocatorSetup( const AllocatorSetup& ) = delete;
    void operator=( const AllocatorSetup& ) = delete;
};

template<class E>
inline E* allocateBuffer( size_t count )
{
    return (E*)ts_alloc->allocate( count * sizeof( E ), alignof( E ) );
}

void computeThings()
{
#pragma omp parallel for
    for( int64_t i = 0; i < 100000; i++ )
    {
        AllocatorSetup allocSetup{ omp_get_thread_num() };
        float* const buffer = allocateBuffer<float>( 1024 * 256 );
        // ..some computations here
    }
}

Is it going to break when multiple threads call computeThings() concurrently?

In other words, at any given time, is the relation between omp_get_thread_num() index and native threads one-to-one or one-to-many?

Soonts
  • 20,079
  • 9
  • 57
  • 130

1 Answers1

2

The documentation spells this out.

The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.

The binding thread set for an omp_get_thread_num region is the current team. The binding region for an omp_get_thread_num region is the innermost enclosing parallel region.

The omp_get_thread_num routine returns the thread number of the calling thread, within the team that is executing the parallel region to which the routine region binds. The thread number is an integer between 0 and one less than the value returned by omp_get_num_threads, inclusive. The thread number of the master thread of the team is 0. The routine returns 0 if it is called from the sequential part of a program.

Emphasis mine.

This means that if you have multiple teams executing concurrently, the threads in each team are numbered 0 .. nbr_threads_in_team. Multiple threads will therefore get the same thread number if they are in different concurrently running teams.

While OpenMP certainly re-uses threads, if it ever actually runs two parallel sections concurrently, you will have two teams of threads running at the same time. Since a single thread can't do two things at once, these threads must be different but yet the threads in each team are numbered 0 .. nbr_threads_in_team.

This means that your computeThings() function is not thread-safe at the moment. You could instead build a thread-safe pool of CustomAllocator objects that a thread can "lend" an object from and return it when the thread is done with it.

Something like this:

void computeThings() {
#pragma omp parallel
    {
        //Since we're in a parallel block, each thread
        //will lend its own CustomAllocator
        CustomAllocator& theAllocator = allocatorPool.lend();

#pragma omp for
        for (...) {
            /* Do stuff with the allocator */
        }
        
        allocatorPool.unLend(theAllocator);
    }
}

In production code, lend() and unLend() should of course be implemented using RAII to avoid leaks in case of an exception. Note that it only acquires new resources once before the entire parallel for loop, not in each iteration of the loop, so you only pay for the cache misses once per call. (It's unlikely that your data stays in L1 in between calls to computeThings unless you're calling that function itself in a loop.) If you do need to be nice to the L1 even across multiple invocations of the function, you could associate each resource with a thread ID and preferentially return resources to the same thread that requested them previously if the "preferred resource" is still available.

Here's an example of such a cache that tries to give each requesting thread its preferred resource instance:

//Warning, untested!
template<typename T>
class ThreadAffineCache {
    std::unordered_map<std::thread::id, std::unique_ptr<T>> m_cache = {};
    std::mutex m_mtx{};
    
public:
    std::unique_ptr<T> lend() {
        std::unique_lock lk{m_mtx};
        auto tid = std::this_thread::get_id();
        std::unique_ptr<T> result = std::move(m_cache[tid]);
        m_cache.erase(tid);
        if (!result) {
            if (m_cache.empty()) {
                result = std::make_unique<T>();
            } else {
                auto iter = m_cache.begin();
                result = std::move(*iter);
                m_cache.erase(iter);
            }
        }
        assert(result);
        return result;
    }
    
    void unLend(std::unique_ptr<T> obj) {
        assert(obj);
        std::unique_lock lk{m_mtx};
        m_cache[std::this_thread::get_id()] = std::move(obj);
    }
}

Alternatively, you could use a map indexed by the value returned by pthread_self or gettid. I'm not sure if std::thread::id returns something sensible for OpenMP worker threads. Cleaning up when threads terminate might also be a problem.

If you just need an easy way out, you could also protect computeThings with a mutex.

Jonathan S.
  • 1,796
  • 5
  • 14
  • “The documentation spells this out” The documentation says nothing on how openmp threads are mapped to native ones. Just because there’re two OMP teams doesn’t mean there’re 2 sets of native threads. I’d expect OpenMP to reuse native threads across teams, threads are relatively expensive. – Soonts Jul 22 '21 at 12:14
  • “a thread-safe pool of CustomAllocator objects” I don’t want to do that because L1 and L2 caches are per CPU core. In the current code allocators are pinned to threads, and the data stays in these local caches of the cores. With a pool, the mapping becomes random. – Soonts Jul 22 '21 at 12:14
  • I've edited my answer to make it more clear what the documentation says about this. – Jonathan S. Jul 22 '21 at 12:16
  • I've also just added an example implementation of an object cache that doesn't need a fixed number of threads and tries to give each thread its preferred object when possible. It's untested, but if you have any problems with it, let me know! – Jonathan S. Jul 22 '21 at 12:52
  • Thanks, it makes sense. As for the workaround, I’m on Windows and fairly confident in `GetCurrentThreadId()`function. Very cheap too, a single read from the thread information block. Replaced vector with a hash map that maps `uint32_t` keys to `CustomAllocator` values. – Soonts Jul 22 '21 at 12:58
  • About the cleanup, I don’t use a function-static variable in production, I have a global collection variable protected with an `std::mutex`. And I have ways to figure out when all the threads did their things, the allocators are no longer needed, and the RAM should be released to the OS. – Soonts Jul 22 '21 at 12:59