Why is the STL priority_queue not much faster than multiset in this case?

Question

I am comparing performance of an STL (g++) priority_queue and found that push and pop are not as fast as I would expect. See the following code:

#include <set>
#include <queue>

using namespace std;

typedef multiset<int> IntSet;

void testMap()
{
    srand( 0 );

    IntSet iSet;

    for ( size_t i = 0; i < 1000; ++i )
    {
        iSet.insert(rand());
    }

    for ( size_t i = 0; i < 100000; ++i )
    {
        int v = *(iSet.begin());
        iSet.erase( iSet.begin() );
        v = rand();
        iSet.insert(v);
    }
}

typedef priority_queue<int> IntQueue;

void testPriorityQueue()
{
    srand(0);
    IntQueue q;

    for ( size_t i = 0; i < 1000; ++i )
    {
        q.push(rand());
    }

    for ( size_t i = 0; i < 100000; ++i )
    {
        int v = q.top();
        q.pop();
        v = rand();
        q.push(v);
    }
}

int main(int,char**)
{
   testMap();
   testPriorityQueue();
}

I compiled this -O3 and then ran valgrind --tool=callgrind, KCachegrind testMap takes 54% of total CPU testPriorityQueue takes 44% of CPU

(Without -O3 testMap is a lot faster than testPriorityQueue) The function that seems to take most of the time for testPriorityQueue is called

void std::__adjust_heap<__gbe_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, int, std::less<int> >

That function seems to be called from the pop() call.

What does this function do exactly? Is there a way to avoid it by using a different container or allocator?

Aren't heaps cache-unfriendly? At least that's been my general impression. — user541686, Aug 03 '12 at 17:37
And I think they branch a lot in unpredictable ways. That function looks like it's what's responsible for heap "bubbling" which is the log(n) operation that has to be performed on the heap every time an element is removed to maintain its order. — Wug, Aug 03 '12 at 17:46
CPU% is not a useful way to test performance or speed. `__adjust_heap` "rebalances" the priority queue, and is the only slow operation when dealing with priotity queues. It's intrinsic to prioriy queues, the only alternative I can think of is `std::set` which has to balance in a similar way. — Mooing Duck, Aug 03 '12 at 17:50
I have done a simple priority_queue template this afternoon and it is almost twice as fast as std::priority_queue when compiled -O3 in Linux 64 bit. — Jeroen Dirks, Aug 03 '12 at 19:56

Useless · Answer 1 · 2012-08-03T18:31:00.940

The priority queue is implemented as a heap: this has to be "rebalanced" every time you remove the head element. In the linked description, delete-min is an O(log n) operation, really because the min (or head) element is the root of the flattened binary tree.

The set is usually implemented as a red-black tree, and the min element will be the leftmost node (so either a leaf, or having at most a right child). Therefore it has at most 1 child to be moved, and rebalancing can be amortized over multiple pop calls, based on the allowable degree of un-balanced-ness.

Note that if the heap has any advantage, it's likely to be in locality-of-reference (since it is contiguous rather than node-based). This is exactly the sort of advantage that may be harder for callgrind to measure accurately, so I'd suggest running some elapsed-real-time benchmark as well before accepting this result.

The min element does not have to be a leaf - it may have a right child. — Ivan Vergiliev, Aug 03 '12 at 18:21

score 2 · Answer 2 · answered Aug 03 '12 at 20:00

I have implemented a priority queue that seems to run faster when compiled with -O3. Maybe just because the compiler was able to inline more than in the STL case?

#include <set>
#include <queue>
#include <vector>
#include <iostream>

using namespace std;

typedef multiset<int> IntSet;

#define TIMES 10000000

void testMap()
{
    srand( 0 );

    IntSet iSet;

    for ( size_t i = 0; i < 1000; ++i ) {
        iSet.insert(rand());
    }

    for ( size_t i = 0; i < TIMES; ++i ) {
        int v = *(iSet.begin());
        iSet.erase( iSet.begin() );
        v = rand();
        iSet.insert(v);
    }
}

typedef priority_queue<int> IntQueue;

void testPriorityQueue()
{
    srand(0);
    IntQueue q;

    for ( size_t i = 0; i < 1000; ++i ) {
        q.push( rand() );
    }

    for ( size_t i = 0; i < TIMES; ++i ) {
        int v = q.top();
        q.pop();
        v = rand();
        q.push(v);
    }
}


template <class T>
class fast_priority_queue
{
public:
    fast_priority_queue()
        :size(1) {
        mVec.resize(1); // first element never used
    }
    void push( const T& rT ) {
        mVec.push_back( rT );
        size_t s = size++;
        while ( s > 1 ) {
            T* pTr = &mVec[s];
            s = s / 2;
            if ( mVec[s] > *pTr ) {
                T tmp = mVec[s];
                mVec[s] = *pTr;
                *pTr = tmp;
            } else break;
        }
    }
    const T& top() const {
        return mVec[1];
    }
    void pop() {
        mVec[1] = mVec.back();
        mVec.pop_back();
        --size;
        size_t s = 1;
        size_t n = s*2;
        T& rT = mVec[s];
        while ( n < size ) {
            if ( mVec[n] < rT ) {
                T tmp = mVec[n];
                mVec[n] = rT;
                rT = tmp;
                s = n;
                n = 2 * s;
                continue;
            }
            ++n;
            if ( mVec[n] < rT ) {
                T tmp = mVec[n];
                mVec[n] = rT;
                rT = tmp;
                s = n;
                n = 2 * s;
                continue;
            }
            break;
        }
    }
    size_t size;
    vector<T> mVec;
};

typedef fast_priority_queue<int> MyQueue;

void testMyPriorityQueue()
{
    srand(0);
    MyQueue q;

    for ( size_t i = 0; i < 1000; ++i ) {
        q.push( rand() );
    }

    for ( size_t i = 0; i < TIMES; ++i ) {
        int v = q.top();
        q.pop();
        v = rand();
        q.push(v);
    }
}


int main(int,char**)
{
    clock_t t1 = clock();
    testMyPriorityQueue();
    clock_t t2 = clock();
    testMap();
    clock_t t3 = clock();
    testPriorityQueue();
    clock_t t4 = clock();

    cout << "fast_priority_queue: " << t2 - t1 << endl;
    cout << "std::multiset: " << t3 - t2 << endl;
    cout << "std::priority_queue: " << t4 - t3 << endl;
}

When compiled with g++ 4.1.2 flag: -O3 on 64 bit Linux this gives me:

fast_priority_queue: 260000
std::multiset: 620000
std::priority_queue: 490000

Unfortunately, your `pop()` method is not correct: When moving the new head node downwards, it has to be swapped with its **smallest** child. Otherwise the heap property will be violated immediately. — ph4nt0m, Aug 20 '15 at 22:46

Why is the STL priority_queue not much faster than multiset in this case?

2 Answers2