Why is PPL significantly slower than sequential loop and OpenMP in this case

Question

Further to my question on CodeReview, I am wondering why the PPL implementation of a simple transform of two vectors using std::plus<int> was so much slower than the sequential std::transform and using a for loop with OpenMP (sequential (with vectorization): 25ms, sequential (without vectorization): 28ms, C++AMP: 131ms, PPL: 51ms, OpenMP: 24ms).

I used the following code for profiling and compiled with full optimizations in Visual Studio 2013:

#include <amp.h>
#include <iostream>
#include <numeric>
#include <random>
#include <assert.h>
#include <functional>
#include <chrono>

using namespace concurrency;

const std::size_t size = 30737418;

//----------------------------------------------------------------------------
// Program entry point.
//----------------------------------------------------------------------------
int main( )
{
    accelerator default_device;
    std::wcout << "Using device : " << default_device.get_description( ) << std::endl;
    if( default_device == accelerator( accelerator::direct3d_ref ) )
        std::cout << "WARNING!! Running on very slow emulator! Only use this accelerator for debugging." << std::endl;

    std::mt19937 engine;
    std::uniform_int_distribution<int> dist( 0, 10000 );

    std::vector<int> vecTest( size );
    std::vector<int> vecTest2( size );
    std::vector<int> vecResult( size );

    for( int i = 0; i < size; ++i )
    {
        vecTest[i] = dist( engine );
        vecTest2[i] = dist( engine );
    }

    std::vector<int> vecCorrectResult( size );

    std::chrono::high_resolution_clock clock;
    auto beginTime = clock.now();

    std::transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecCorrectResult ), std::plus<int>() );

    auto endTime = clock.now();
    auto timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

#pragma loop(no_vector)
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the sequential function (with auto-vectorization disabled) to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;

    beginTime = clock.now();

    concurrency::array_view<const int, 1> av1( vecTest );
    concurrency::array_view<const int, 1> av2( vecTest2 );
    concurrency::array_view<int, 1> avResult( vecResult );
    avResult.discard_data();

    concurrency::parallel_for_each( avResult.extent, [=]( concurrency::index<1> index ) restrict(amp) {
        avResult[index] = av1[index] + av2[index];
    } );

    avResult.synchronize();
    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the AMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << std::boolalpha << "The AMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

    concurrency::parallel_transform( std::begin( vecTest ), std::end( vecTest ), std::begin( vecTest2 ), std::begin( vecResult ), std::plus<int>() );

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the PPL function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The PPL function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    beginTime = clock.now();

#pragma omp parallel
#pragma omp for
    for( int i = 0; i < size; ++i )
    {
        vecResult[i] = vecTest[i] + vecTest2[i];
    }

    endTime = clock.now();
    timeTaken = endTime - beginTime;

    std::cout << "The time taken for the OpenMP function to execute was: " << std::chrono::duration_cast<std::chrono::milliseconds>(timeTaken).count() << "ms" << std::endl;
    std::cout << "The OpenMP function generated the correct answer: " << (vecResult == vecCorrectResult) << std::endl;

    return 0;
}

score 5 · Answer 1 · answered Jul 07 '14 at 10:59

According to MSDN, the default partitioner for concurrency::parallel_transform is concurrency::auto_partitioner. And when it comes to it:

This method of partitioning employes range stealing for load balancing as well as per-iterate cancellation.

Using this partitioner is an overkill for a simple (and memory-bound) operation such as summing two arrays since the overhead is huge. You should instead use the concurrency::static_partitioner. The static partitioning is exactly what most OpenMP implementations use by default when the schedule clause is missing from the for construct.

As already mentioned on Code Review, this is a very memory-bound code. It is also the SUM kernel of the STREAM benchmark, which was specifically designed to measure the memory bandwidth of the system it's run on. The a[i] = b[i] + c[i] operation has a very low operational intensity (measured in OPS/byte) and its speed is solely determined by the bandwidth of the main memory bus. That's why the OpenMP code and the vectorised serial code deliver basically the same performance, which is not that much higher than the performance of the non-vectorised serial code.

The way to get higher parallel performance is to run the code on a modern multi-socket system and have the data in each array distributed evenly across the sockets. Then you can get speed-up that almost equals the number of CPU sockets.

Good answer (+1), I found this by you https://stackoverflow.com/questions/11576670/in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987#11579987 which actually shows the scaling per core. — Z boson, Jul 09 '14 at 12:07

Why is PPL significantly slower than sequential loop and OpenMP in this case

1 Answers1