0

I have a piece of code that I use to test various containers (e.g. deque and a circular buffer) when passing data from a producer (thread 1) to a consumer (thread 2). A data is represented by a struct with a pair of timestamps. First timestamp is taken before push in the producer, and the second one is taken when data is popped by the consumer. The container is protected with a pthread spinlock.

The machine runs redhat 5.5 with 2.6.18 kernel (old!), it is a 4-core system with hyperthreading disabled. gcc 4.7 with -std=c++11 flag was used in all tests.

Producer acquires the lock, timestamps the data and pushes it into the queue, unlocks and sleeps in a busy loop for 2 microseconds (the only reliable way I found to sleep for precisely 2 micros on that system).

Consumer locks, pops the data, timestamps it and generates some statistics (running mean delay and standard deviation). The stats is printed every 5 seconds (M is the mean, M2 is the std dev) and reset. I used gettimeofday() to obtain the timestamps, which means that the mean delay number can be thought of as the percentage of delays that exceed 1 microsecond.

Most of the time the output looks like this:

    CNT=2500000 M=0.00935 M2=0.910238
    CNT=2500000 M=0.0204112 M2=1.57601
    CNT=2500000 M=0.0045016 M2=0.372065

but sometimes (probably 1 trial out of 20) like this:

    CNT=2500000 M=0.523413 M2=4.83898
    CNT=2500000 M=0.558525 M2=4.98872
    CNT=2500000 M=0.581157 M2=5.05889

(note the mean number is much worse than in the first case, and it never recovers as the program runs).

I would appreciate thoughts on why this could happen. Thanks.

#include <iostream>
#include <string.h>
#include <stdexcept>
#include <sys/time.h>
#include <deque>
#include <thread>
#include <cstdint>
#include <cmath>
#include <unistd.h>
#include <xmmintrin.h> // _mm_pause()

int64_t timestamp() {
    struct timeval tv;
    gettimeofday(&tv, 0);
    return 1000000L * tv.tv_sec + tv.tv_usec;
}

//running mean and a second moment
struct StatsM2 {
    StatsM2() {}
    double m = 0;
    double m2 = 0;
    long count = 0;
    inline void update(long x, long c) {
        count = c;
        double delta = x - m;
        m += delta / count;
        m2 += delta * (x - m);
    }
    inline void reset() {
        m = m2 = 0;
        count = 0;
    }
    inline double getM2() { // running second moment
        return (count > 1) ? m2 / (count - 1) : 0.;
    }
    inline double getDeviation() {
        return std::sqrt(getM2() );
    }
    inline double getM() { // running mean
        return m;
    }
};

// pause for usec microseconds using busy loop
int64_t busyloop_microsec_sleep(unsigned long usec) {
    int64_t t, tend;
    tend = t = timestamp();
    tend += usec;
    while (t < tend) {
        t = timestamp();
    }
    return t;
}

struct Data {
    Data() : time_produced(timestamp() ) {}
    int64_t time_produced;
    int64_t time_consumed;
};

int64_t sleep_interval = 2;
StatsM2 statsm2;
std::deque<Data> queue;
bool producer_running = true;
bool consumer_running = true;
pthread_spinlock_t spin;

void producer() {
    producer_running = true;
    while(producer_running) {
        pthread_spin_lock(&spin);
        queue.push_back(Data() );
        pthread_spin_unlock(&spin);
        busyloop_microsec_sleep(sleep_interval);
    }
}

void consumer() {
    int64_t count = 0;
    int64_t print_at = 1000000/sleep_interval * 5;
    Data data;
    consumer_running = true;
    while (consumer_running) {
        pthread_spin_lock(&spin);
        if (queue.empty() ) {
            pthread_spin_unlock(&spin);
            // _mm_pause();
            continue;
        }
        data = queue.front();
        queue.pop_front();
        pthread_spin_unlock(&spin);
        ++count;
        data.time_consumed = timestamp();
        statsm2.update(data.time_consumed - data.time_produced, count);
        if (count >= print_at) {
            std::cerr << "CNT=" << count << " M=" << statsm2.getM() << " M2=" << statsm2.getDeviation() << "\n";
            statsm2.reset();
            count = 0;
        }
    }
}

int main(void) {
    if (pthread_spin_init(&spin, PTHREAD_PROCESS_PRIVATE) < 0)
        exit(2);
    std::thread consumer_thread(consumer);
    std::thread producer_thread(producer);
    sleep(40);
    consumer_running = false;
    producer_running = false;
    consumer_thread.join();
    producer_thread.join();
    return 0;
}
Cattus
  • 147
  • 6
  • Try to record the full data set that make up the stats to track this down. It is conceivable that you get these because of garbage in the data. While the dequeue is not thread safe, you shouldn’t need `volatile` on the face of it - because the lock is supposed to issue the barriers. Nevertheless maybe the compiler is still moving the variable around – hard to say without looking at the disassembly. Mark the dequeue `volatile` to see if it helps. – mockinterface Feb 03 '14 at 08:58
  • I recorded the data, it is consistent, and unfortunately volatile didn't help. Thank you for your suggestions. – Cattus Feb 04 '14 at 16:47
  • I would add that the throughput suffers as well; when things run ok, I can pump millions of messages through the queue, when timings are bad, the count is in the tens of thousands. – Cattus Feb 04 '14 at 21:17

2 Answers2

1

EDIT:
I believe that 5 below is the only thing that can explain 1/2 second latency. When on the same core, each would run for a long time and only then switch to the other.
The rest of the things on the list are too small to cause a 1/2 second delay.
You can use pthread_setaffinity_np to pin your threads to specific cores. You can try different combinations and see how performance changes.

EDIT #2:
More things you should take care of: (who said testing was simple...)
1. Make sure the consumer is already running when the producer starts producing. Not too important in your case as the producer is not really producing in a tight loop.
2. This is very important: you divide by count every time, which is not the right thing to do for your stats. This means that the first measurement in every stats window weight a lot more than the last. To measure the median you have to collect all the values. Measuring the average and min/max, without collecting all numbers, should give you a good enough picture of the latency.


It's not surprising, really.
1. The time is taken in Data(), but then the container spends time calling malloc.
2. Are you running 64 bit or 32? In 32 bit gettimeofday is a system call while in 64 bit it's a VDSO that doesn't get into the kernel... you may want to time gettimeofday itself and record the variance. Or enroll your own using rdtsc.
The best would be to use cycles instead of micros because micros are really too big for this scenario... only the rounding to micros gets you very much skewed when dealing with such a small scale of things
3. Are you guaranteed to not get preempted between producer and consumer? I guess that not. But this should not happen very frequently on a box dedicated to testing...
4. Is it 4 cores on a single socket or 2? if it's a 2 socket box, you want to have the 2 threads on the same socket, or you pay (at least) double for data transfer.
5. Make sure the threads are not running on the same core.
6. If the Data you transfer and the additional data (container node) are sharing cache lines (kind of likely) with other Data+node, the producer would be delayed by the consumer when it writes to the consumed timestamp. This is called false sharing. You can eliminate this by padding/aligning to 64 bytes and using an intrusive container.

BitWhistler
  • 1,439
  • 8
  • 12
0

gettimeofday is not a good way to profile computation overhead. It is the wall clock and your computer is multiprocessing. Even you think you are not running anything else, the OS scheduler always has some other activities to keep the system running. To profile your process overhead, you have to at least raise the priority of the process you are profiling. Also use high resolution timer or cpu ticks to do the timing measure.

simon
  • 391
  • 1
  • 9