Very slow mutex in LLVM/OpenMP

Question

I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).

In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).

If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.

#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>

static double ActionWithNumber(double number)
{
    double sum = 0.0f;
    for (std::uint32_t i = 0; i < 50; i++)
    {
        double coeff = sqrt(pow(std::abs(number), 0.1));
        double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
        sum += sqrt(res);
    }
    return sum;
}

static double TestOpenMP(void)
{
    const std::uint32_t len = 4000000;
    double *a;
    double *b;
    double *c;
    double sum = 0.0;

    std::mutex _mutex;
    a = new double[len];
    b = new double[len];
    c = new double[len];

    for (std::uint32_t i = 0; i < len; i++)
    {
        c[i] = 0.0;
        a[i] = sin((double)i);
        b[i] = cos((double)i);
    }
    boost::chrono::time_point<boost::chrono::system_clock> start, end;
    start = boost::chrono::system_clock::now();
    double k = 2.0;
    omp_set_num_threads(4);
#pragma omp parallel for 
    for (int i = 0; i < len; i++)
    {
        c[i] = k*a[i] + b[i] + k;
        if (c[i] > 0.0)
        {
            c[i] += ActionWithNumber(c[i]);
        }
        else
        {
            c[i] -= ActionWithNumber(c[i]);
        }
        std::lock_guard<std::mutex> scoped(_mutex);
        sum += c[i];
    }
    end = boost::chrono::system_clock::now();
    boost::chrono::duration<double> elapsed_time = end - start;
    double sum2 = 0.0;
    for (std::uint32_t i = 0; i < len; i++)
    {
        sum2 += c[i];
        c[i] /= sum2;
    }
    if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
    delete[] a;
    delete[] b;
    delete[] c;
    return elapsed_time.count();
}

int main()
{

    double sum = 0.0;
    const std::uint32_t steps = 5;
    for (std::uint32_t i = 0; i < steps; i++)
    {
        sum += TestOpenMP();
    }
    sum /= (double)steps;
    std::cout << "Elapsed time = " <<  sum;
    return 0;
}

I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.

What is the cause of this difference?

First of all, thank for edit my post (i use translater to write text). In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements. So just to put a critical section in there comes out, as it will block writes to the entire matrix.

I wrote code like in real application.

static double ActionWithNumber(double number)
{
    const unsigned int steps = 5000;
    double sum = 0.0f;
    for (u32 i = 0; i < steps; i++)
    {
        double coeff = sqrt(pow(abs(number), 0.1));
        double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
        sum += sqrt(res);
    }
    sum /= (double)steps;
    return sum;
}

static double RealAppTest(void)
{
    const unsigned int elementsNum = 10000;
    double* matrix;
    unsigned int* elements;
    boost::mutex* mutexes;

    elements = new unsigned int[elementsNum*3];
    matrix = new double[elementsNum*elementsNum];
    mutexes = new boost::mutex[elementsNum];
    for (unsigned int i = 0; i < elementsNum; i++)
        for (unsigned int j = 0; j < elementsNum; j++)
            matrix[i*elementsNum + j] = (double)(rand() % 100);
    for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
    {
        elements[3*i] = rand()%(elementsNum-1);
        elements[3*i+1] = rand()%(elementsNum-1);
        elements[3*i+2] = rand()%(elementsNum-1);
    }
    boost::chrono::time_point<boost::chrono::system_clock> start, end;
    start = boost::chrono::system_clock::now();
    omp_set_num_threads(4);
#pragma omp parallel for
    for (int i = 0; i < elementsNum; i++)
    {
        unsigned int* elems = &elements[3*i];
        for (unsigned int j = 0; j < 3; j++)
        {
            //in here set mutex for  row with index = elems[j];
            boost::lock_guard<boost::mutex> lockup(mutexes[i]);
            double res = 0.0;
            for (unsigned int k = 0; k < 3; k++)
            {
                res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
            }
            for (unsigned int k = 0; k < 3; k++)
            {
                matrix[elems[j]*elementsNum + elems[k]] = res;
            }
        }
    }
    end = boost::chrono::system_clock::now();
    boost::chrono::duration<double> elapsed_time = end - start;

    delete[] elements;
    delete[] matrix;
    delete[] mutexes;
    return elapsed_time.count();
}

int main()
{
    double sum = 0.0;
    const u32 steps = 5;
    for (u32 i = 0; i < steps; i++)
    {
        sum += RealAppTest();
    }
    sum /= (double)steps;
    std::cout<<"Elapsed time = " <<  sum;
    return 0; 
}

Does the program really run for 25 secs? (when you look at your watch) — Kami Kaze, Feb 22 '17 at 15:49
For such performance considerations, you have to specify more details that allow to understand and reproduce the issue, such as compiler options and hardware configurations of the systems as well as a [mcve]. Also clang 3.7 only supports OpenMP 3.1, OpenMP 5.0 is not even final yet. VS2010 certainly does not implement it. — Zulan, Feb 22 '17 at 16:53
It seems that under the hood a lightweight mutex is used on windows. On the other hand your implementation of prarallel sum is very bad. Use a local accumulator for every thread and add them together at the end. — knivil, Feb 22 '17 at 19:03
Btw A mutex, is the wrong thing to use in this case. You solve this efficiently with a reduction, see e.g http://stackoverflow.com/questions/42395568/openmp-why-does-the-number-of-comparisons-decrease/42398834#42398834 — Zulan, Feb 22 '17 at 19:19
I said that I use mutex only for performance test. In my real application i am using mutex in some time. When i begin to port app from win to mac I see that performance slowdown. — D.Sedov, Feb 22 '17 at 19:29
Before blaming OpenMP, I suggest that you time the mutex performance in a code that is simply using ptrheads. My guess (since I haven't timed it) is that the performance there will also be horrible! — Jim Cownie, Feb 23 '17 at 11:45
If in the last example to replace the array of mutexes on a single or critical section, then the entire algorithm will run in only one thread. — D.Sedov, Feb 23 '17 at 13:02

Jonathan Dursi · Accepted Answer · 2017-02-23T13:42:07.570

4

You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.

My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.

For mutex-like behaviour in the OpenMP framework, we can use critical sections:

#pragma omp parallel for 
for (int i = 0; i < len; i++)
{
    //...
    // replacing this: std::lock_guard<std::mutex> scoped(_mutex);
    #pragma omp critical
    sum += c[i];
}

or explicit locks:

omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for 
for (int i = 0; i < len; i++)
{
    //...
    // replacing this: std::lock_guard<std::mutex> scoped(_mutex);
    omp_set_lock(&sumlock);
    sum += c[i];
    omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);

We get much more reasonable timings:

$ time ./openmp-original
real    1m41.119s
user    1m15.961s
sys 1m53.919s

$ time ./openmp-critical
real    0m16.470s
user    1m2.313s
sys 0m0.599s

$ time ./openmp-locks
real    0m15.819s
user    1m0.820s
sys 0m0.276s

Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:

omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++) 
    omp_init_lock(&(sumlocks[idx]));

//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
    unsigned int* elems = &elements[3*i];
    for (unsigned int j = 0; j < 3; j++)
    {
        //in here set mutex for  row with index = elems[j];
        double res = 0.0;
        for (unsigned int k = 0; k < 3; k++)
        {
            res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
        }
        omp_set_lock(&(sumlocks[i]));
        for (unsigned int k = 0; k < 3; k++)
        {
            matrix[elems[j]*elementsNum + elems[k]] = res;
        }
        omp_unset_lock(&(sumlocks[i]));
    }
}
for (unsigned idx=0; idx<elementsNum; idx++) 
    omp_destroy_lock(&(sumlocks[idx]));

edited Feb 23 '17 at 13:42

answered Feb 23 '17 at 03:16

Jonathan Dursi

50,107
9
127
158

I answered you in the next post. – D.Sedov Feb 23 '17 at 07:42
What is the point in explicit locks? I have never used them. I just use `critical`. Can you give me an example where explicit locks are the better choice? – Z boson Feb 23 '17 at 07:52
I description my work case in the end of post. I don't understand how use critical section in this case. In test i used critical too, but speed slower than when I use mutex (on Win machine). – D.Sedov Feb 23 '17 at 08:26
2

@Zboson. The OP has just give you such an example. A critical section would serialize all updates to his array, whereas by using locks he can serialize only the accesses to individual rows. – Jim Cownie Feb 23 '17 at 11:47
1

@D.Sedov "I don't understand how use critical section in this case", so use OpenMP locks! – Jim Cownie Feb 23 '17 at 11:48
@JimCownie, wouldn't a reduction be more appropriate in the first place (I want a useful example)? But I don't see how the lock is any different than critical in this case(I see in the answer that the lock is a bit faster but that could be within the timing uncertainty). Each thread cannot update `sum` while another one is writing to it. There is no implicit barrier with `critical` so if one thread is very slow another thread could update `sum` and the array multiple times before the slow thread got to `sum`. What are you referring to by row? – Z boson Feb 23 '17 at 12:43
I added the code as in a real application. The code is simplified, but shows how I use mutexes to fill the matrix. – D.Sedov Feb 23 '17 at 12:55
If in the last example to replace the array of mutexes on a single or critical section, then the entire algorithm will run in only one thread. – D.Sedov Feb 23 '17 at 13:02
@D.Sedov - it seems like the solution is to use an array of omp_locks (just as with mutexes). – Jonathan Dursi Feb 23 '17 at 13:30
@JimCownie, I only read the code not the next. I see the OP discusses row and a real example in the text. – Z boson Feb 23 '17 at 13:38
@Jonathan Dursi Yes it is. I use mutexes traditionally, and only recently have used features of openmp, so old-fashioned sometimes write the old solution. But the main thing were not understood, why "Mac" is such a bad performance when using mutexes? – D.Sedov Feb 23 '17 at 14:59
@D.Sedov - You'd have to ask the clang developers - but there's no reason to expect that mixing threading and synchronization systems like this should ever perform well. I have no idea why it happens to work ok with VS and Windows. – Jonathan Dursi Feb 23 '17 at 15:07
@Jonathan Dursi I try to use omp_lock on mac, but it is slower then mutex on win. omp_lock on mac more faster then mutex on mac (ratio = 20). – D.Sedov Feb 23 '17 at 17:35
@D.Sedov how does the omp_lock on win compare with mutex on windows? Are you sure the windows machine isn't just faster than the mac? – Jonathan Dursi Feb 23 '17 at 17:42
@Jonathan Dursi On win opm_lock and boost::mutex have equal performance. I wrote that win-Corei7(3.4GHz) and mac-Corei7(2.7 GHz). – D.Sedov Feb 23 '17 at 19:03
@Jonathan Dursi On Windows there is no such performance degradation (relative) as on the Mac when using a mutex. I'm not talking about the absolute difference in performance of the two systems, and a strong drop it on the Mac. – D.Sedov Feb 23 '17 at 19:06
@Jonathan Dursi If I use omp_lock on Mac then thea are't critical degradate performance as if I am using mutex. I wrote about that above. – D.Sedov Feb 23 '17 at 19:09
@Jonathan Dursi I'm sorry I do not understand why the Mac is so slow mutex, but the solution with opm_lock will allow me to correct the situation with the performance in a better way. – D.Sedov Feb 23 '17 at 19:12
@D.Sedov. Since the slowness here is with std::mutex, there is no point in discussing it with the LLVM OpenMP developers, as it has nothing to do with OpenMP. Note, too, that in OpenMP you have some tuning options for locks, so that you can hint to the runtime whether you want a fast, unfair, spinlock or a slower, fair queueing lock (or, on appropriate Intel HW to use TSX to speculate the critical region). See the description of omp_init_lock_with_hint on page 273 of the OpenMP 4.5 specification available from http://openmp.org – Jim Cownie Feb 24 '17 at 09:25
@Jim Cownie I agree, but then I have to raise the question of the slowness of std::mutex in LLVM on Mac. It should be noted that omp_lock_* is also not a quick thing on the Mac (although it is much faster than a mutex). On Windows I don't see any difference between std::mutex and omp_lock, but on the Mac, the difference is catastrophic. – D.Sedov Feb 24 '17 at 12:55
1

@D.Sedov The implementations of omp_lock_t in the LLVM OpenMP runtime are identical on LInux, Windows and Mac (modulo how they sleep under high contention), (You can see them at http://openmp.llvm.org ) So they should perform the same on each of the platforms. AFAICT the implementation of std::mutex comes from system libraries, not LLVM. (See http://info.prelert.com/blog/cpp11-mutex-implementations for instance) Therefore all of this seems to be "The implementation of std::mutex in MacOS is slower than that in Windows", which has no OpenMP or LLVM component to it. – Jim Cownie Feb 28 '17 at 09:44

Very slow mutex in LLVM/OpenMP

1 Answers1