Why does one loop take longer to detect a shared memory update than another loop?

Question

I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.

I have 2 approaches for the client:

For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.
The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.

When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!

Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?

Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.

Update 2: node->next_ is volatile.

Is a spinlock really necessary here? Why aren't you using something like an event object that the platform provide so the client can wait nicely (without chewing up CPU, cache, or memory access resources) until the server signals there's new data? — Michael Burr, Mar 26 '10 at 15:31
After your update - it seems that in #2 the client and server are communicating over just one shared bit of memory (the bookmark node), so there might be a lot more contention on this resource. Can you time how long it takes the client to read node->next, and how long it takes the server to write it? — Joris Timmermans, Mar 26 '10 at 16:52
The read is always instantaneous (0 or 1 microseconds). I'll check the write... — Joseph Garvin, Mar 26 '10 at 16:57
Nikloai: It's a one word pointer, so I don't think there's anyway to *not* read it atomically. — Joseph Garvin, Mar 26 '10 at 16:58
I'm not sure about that - the type sig_atomic_t seems to be quite specific for atomic reads and writes. But there seem to be two potential issues here: multithreading, or the shared memory mechanism. — Joris Timmermans, Mar 26 '10 at 17:05
The write is also always instantaneous, 0 or 1 microseconds. — Joseph Garvin, Mar 26 '10 at 17:36
But there's a way for compiler and/or processor to reorder the reads. — Nikolai Fetissov, Mar 26 '10 at 17:39
Nikolai: node->next_ is volatile, so the compiler shouldn't reorder them. The processor may reorder them, but if I put __sync_synchronize (full memory barrier) before the loop and inside the loop, there is no difference. — Joseph Garvin, Mar 26 '10 at 18:10

score 2 · Answer 1 · answered Mar 26 '10 at 17:13

2

Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.

answered Mar 26 '10 at 17:13

zdan

28,667
7
60
71

Memory barriers will control the *order* of the writes, which can make them take longer, but I don't think they ever make a write appear to another CPU faster. I've tried putting GCC's __sync_synchronize on both the writing and reading side to no effect. – Joseph Garvin Mar 26 '10 at 17:38

score 0 · Answer 2 · answered Mar 26 '10 at 15:35

0

You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.

answered Mar 26 '10 at 15:35

Joris Timmermans

10,814
2
49
75

Spinlocks outperform signals if the expected wait-time is very low and locking happens often. – Björn Pollex Mar 26 '10 at 15:39
I'm running on a multicore machine, so the server and client aren't stealing cycles from one another. – Joseph Garvin Mar 26 '10 at 16:01

Björn Pollex · Answer 3 · 2010-03-26T15:43:11.360

0

Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.

With Boost.Thread this would look like this:

while(node->next_ == NULL) {
    boost::this_thread::yield( );
}

edited Mar 26 '10 at 15:43

answered Mar 26 '10 at 15:37

Björn Pollex

75,346
28
201
283

I could, but how would this help? The problem is latency, not CPU usage. I'm running on a multicore machine, so the server and client aren't stealing cycles from one another. – Joseph Garvin Mar 26 '10 at 15:58

Why does one loop take longer to detect a shared memory update than another loop?

3 Answers3