10

I have the following code for interprocess communication through shared memory. One process writes to a log and the other reads from it. One way is to use semaphores, but here I'm using atomic flag (log_flag) of type atomic_t which resides inside the shared memory. The log (log_data) is also shared.

Now the question is, would this work for x86 architecture or do I need semaphores or mutexes? What if I make log_flag non-atomic? Given x86 has a strict memory model and proactive cache coherence, and optimizations are not applied on pointers, I think it would still work?

EDIT: Note that I have a multicore processor with 8 cores, so I don't have any problem with busy waits here!

// Process 1 calls this function
void write_log( void * data, size_t size )
{
    while( *log_flag )
           ;
    memcpy( log_data, data, size );
    *log_flag = 1;
}

// Process 2 calls this function
void read_log( void * data, size_t size )
{
    while( !( *log_flag ) )
       ;
    memcpy( data, log_data, size );
    *log_flag = 0;
}
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
  • 9
    Having a multi-core processor doesn't make a busy-wait loop a good idea - you're needlessly burning power, and blocking out other unrelated processes. – Oliver Charlesworth Jan 02 '12 at 18:55
  • 4
    Because you're just sending data serially, blocking is acceptable, and you don't want to mess with semaphores, you should use pipes. – John K Jan 02 '12 at 19:49
  • 1
    You should declare `log_flag` with the `volatile` keyword at the least (to tell the compiler it could change without it knowing how). The unbounded busy loop is still a bad idea. Consider spinning your wheels for a small count, and then moving to a blocking mechanism. If you don't think you'll ever get the change while doing the small count, go for a blocking mechanism anyway. – Jonathan Leffler Jan 02 '12 at 19:56
  • 3
    @JonathanLeffler: [`volatile` is useless for multithreading](http://stackoverflow.com/a/4558031/87234). – GManNickG Jan 02 '12 at 20:17
  • 3
    @GMan, volatile is not useless in every case. Here is will say to compiler that it should re-read the `*log_flag` from memory at every iteration, and not to cache the value in register (converting busy-loop into infinite loop). When we have 2 processes on 2 CPUs and there is a shared memory, the change in the memory looks for compiler very like `memory-mapped hardware` operation or like `signal handler` operation. – osgx Jan 02 '12 at 23:53
  • @osgx: Read the answer I linked to. `volatile` is literally useless for multithreading. Why do you think C++11 added atomics and a specified memory model? – GManNickG Jan 03 '12 at 00:02
  • 2
    @GMan, Not every linked answer is true. Even in C++11 added atomicts, [there is an `volatile`](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html). Volatile is needed in busy-waiting: ["when the processor is busy-waiting on the value of a variable"](http://alinux.tv/Kernel-2.6.34/volatile-considered-harmful.txt), if there is no "compiler barrier" inside busy loop. – osgx Jan 03 '12 at 00:17
  • 3
    Without `volatile` the compiler is free to optimize this code `while( *log_flag ) ;` to `int tmp = *log_flag; while(tmp) ;`. `Volatile` is not useless, it just doesn't work like `volatile` in Java. – Bartosz Milewski Jan 03 '12 at 18:11
  • 1
    @osgx: You don't need or want `volatile` for a busy-loop, it won't prevent data races or enforce atomicity. I'm done arguing this, it's a decades old misconception and rather uninteresting and easy to search for. – GManNickG Jan 03 '12 at 20:18
  • 1
    @BartoszMilewski: Sure, but what's guaranteed to prevent the race condition on `log_flag`? Not `volatile`, that's for sure. And by the time you actually get atomicity and ordering, you no longer need `volatile` because you don't *want* to suppress optimizations. Like I said to osgx, though, I'm done. Just do some searching online, this topic has been dead for years. – GManNickG Jan 03 '12 at 20:20
  • @BartoszMilewski,@osgx: The [Intel link](http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/) from @GMan's linked answer is very informative. The take away is `volatile` solves ***some*** of concurrent access problems, whereas `std::atomic` or `std::mutex` solve ***all*** of them. 99% of the time `volatile` is insufficient protection. There is one exception, a dead man loop, a while loop on a volatile variable that will be written to `false` once and only once. Any other use case and the compiler/cpu/cache can introduce breaking changes. – deft_code Jan 05 '12 at 01:15

4 Answers4

4

You may want to use the following macro in the loop, to avoid stressing the memory bus:

#if defined(__x86_64) || defined(__i386)
#define cpu_relax() __asm__("pause":::"memory")
#else
#define cpu_relax() __asm__("":::"memory")
#endif

Also, it acts as a memory barrier ("memory" param.), so no need to declare log_flag as volatile.

But I think this is overkill, it should only be done for hard real-time stuff. You should be fine using a futex. And maybe you could simply use a pipe, it's sufficiently fast for almost all purposes.

Ismael Luceno
  • 2,055
  • 15
  • 26
2

I wouldn't recommend that for two reasons: first, although pointer access may not be optimized by the compiler, that doesn't mean the pointed value won't be cached by the processor. Second, the fact that it is atomic won't prevent a read access between the end of the while loop and the line that does *log_flag=0. A mutex is safer, though a lot slower.

If you're using pthreads, consider using an RW mutex to protect the whole buffer, that way you don't need a flag to control it, the mutex is itself the flag and you'll have better performance when doing frequent reads.

I also don't recommend doing empty while() loops, you'll hog all the processor that way. Put a usleep(1000) inside the loop to give the processor a chance to breathe.

Fabio Ceconello
  • 15,819
  • 5
  • 38
  • 51
  • `Sleep(1)` on Linux? I'd rather recommend `usleep(1)` or the same with a higher waiting time. – Niklas B. Jan 02 '12 at 18:46
  • 4
    Busy waits aren't as bad as you put it out to be. Particularly if you have a multiprocessor system and want to take advantage of it. – Jeff Mercado Jan 02 '12 at 18:50
  • @ Niklas you're right, I've been conditioned by too much Win32 programming. Sorry. – Fabio Ceconello Jan 02 '12 at 18:51
  • @Jeff agree, but in such case you have to make sure you control the affinity – Fabio Ceconello Jan 02 '12 at 18:53
  • @JeffMercado busy waits are pretty bad in a multi-process system, since you're using up resources that could be free for other processes. – João Portela Jan 02 '12 at 18:53
  • 5
    @Jeff: Can you elaborate? I don't see how causing 100% CPU utilization and completely using up your scheduled time slice *every time during the wait*, thus lowering the priority of the process should be "taking advantage of MP systems". – Niklas B. Jan 02 '12 at 18:55
  • @Fabio, I don't understand what you mean by, " Second, the fact that it is atomic won't prevent a read access between the end of the while loop and the line that does *log_flag=0. A mutex is safer, though a lot slower." Read by whom, process1(the one doing write_log)? If you mean that, there is no problem, as write_log sets log_flag to 1 in the end and on next write_log call checks if log_flag is still 1 before doing memcpy. – MetallicPriest Jan 02 '12 at 18:59
  • @MetallicPriest the first time I read your question it seemed to me log_flag was being used as a lock, that was a misinterpretation. I see it is an availability signal. In that case, my comment doesn't apply, but you're limited to using it only between a pair of processes. – Fabio Ceconello Jan 02 '12 at 19:14
  • 4
    @NiklasBaumstark: If you know you won't be waiting very long for a resource, the busy wait can save you the overhead of context switching out and waiting to be scheduled again. It won't work well if on a single processor system (since you'll have to context switch out anyway). They have their uses, you just have to be mindful of what resources are being protected and the system architecture. – Jeff Mercado Jan 02 '12 at 19:18
  • @Jeff: Now I see. I agree that the overhead of context switches has to be consider for very short waits, but only from the point of view of a single, *egoistic* application. On an interactive system, it might be more sensible to give other processes a chance to run when waiting. – Niklas B. Jan 02 '12 at 19:22
  • Some win32 functions, such as "InitializeCriticalSectionAndSpinCount" let you set the number of cycles it will sit in a busy loop before context switch. This gives you the best of both methods. – John K Jan 02 '12 at 19:54
  • @FabioCeconello, about the cache issue you mention, x86 uses a strongly coherent cache, but for the case you describe, only cache-incoherent systems would be in trouble. – Ismael Luceno Jan 02 '12 at 23:41
  • @Ismael consider that you can also have a multiprocessor, not just multicore machine. – Fabio Ceconello Jan 03 '12 at 00:19
  • 1
    @FabioCeconello: doesn't make any difference. Unless you're talking about the early (i.e. pre-Pentium) x86 SMP systems... – Ismael Luceno Jan 03 '12 at 01:33
  • @FabioCeconello: please see my comments under doron's answer. – Ismael Luceno Jan 03 '12 at 01:56
  • Ismail, I agree with you. Fabio, even if you have say dual socket multicore processors, still there will some cache which is combined b/w them. Usually the L3 cache is shared in that case. – MetallicPriest Jan 03 '12 at 12:05
  • @Ismael, I understand your point, but also agree to what Janeb responded. Furthermore, I find dangerous to rely in such intrincacies, especially in an area in which it's difficult to validate the code and be sure it'll continue to work in the future. Also take a look at some relevant remarks in this paper http://msdn.microsoft.com/en-us/magazine/cc163715.aspx#S4 – Fabio Ceconello Jan 03 '12 at 13:10
  • @FabioCeconello: The only case I can think of right now (for both references) is memory bus contention, but that is irrelevant in the case of a single writer. – Ismael Luceno Jan 03 '12 at 16:44
  • 1
    @Ismael I think your suggestion about futex is probably the best balance of performance and safety. Didn't know it existed, one more thing learned :-) . – Fabio Ceconello Jan 03 '12 at 20:05
1

There are a whole bunch of reasons why you should use a semaphore and not rely on a flag.

  1. Your read log while loop is spinning unnnecessarily. This consumes system resources like power unnecessarly. It also means that the CPU cannot be used for other tasks.
  2. I will be surprised if x86 fully guarantees read and write ordering. incoming data may set log flag to 1 only to have outgoing data set it to 0. This may potentially mean that you end up losing data.
  3. I don't know where you got it from that optimizations are not applied on pointers as a general use. Optimization can be applied anywhere where there is no difference to external change. The compiler will probably not know that log_flag can be changed by a concurrent process.

Problem 2 might appear very may appear rarely and tracking down the issue will be hard. So do yourself a favour and use the correct operating system primitives. They will guarantee that things work as expected.

doron
  • 27,972
  • 12
  • 65
  • 103
  • If pointers could be optimized, why was the restrict keyword invented in C99? The compiler takes no risk with pointers, as they can point anywhere in the memory. – MetallicPriest Jan 02 '12 at 19:27
  • x86 guarantees write ordering, but only within a CPU, and doesn't guarantee read ordering even within a CPU. – ugoren Jan 02 '12 at 19:44
  • `atomic_t` is defined to create memory barriers if needed to ensure proper visible ordering, isn't it? – Ben Voigt Jan 02 '12 at 19:53
  • @BenVoigt: GCC has no builtin knowledge of any "atomic_t" type, so without further information what it's supposed to be, no. – janneb Jan 02 '12 at 20:02
  • @janneb: Ah, I'm thinking of `std::atomic`, not `atomic_t`. – Ben Voigt Jan 02 '12 at 20:13
  • 1
    @doron: x86 snoops the memory bus and invalidates proactively. – Ismael Luceno Jan 03 '12 at 00:34
  • 1
    @ugoren: x86 cache is strongly coherent, so by-design it can't do write re-ordering, unless the corresponding memory page is marked for write-combining (via MTRRs or PAT), something you must do explicitly anyway... – Ismael Luceno Jan 03 '12 at 00:34
  • The x86 cache may be strongly coherent, but x86 as an architecture is weakly ordered, in the sense that the architecture doesn't guarantee sequential consistency in all cases (for normal write-back memory) without explicit fence instructions. – janneb Jan 03 '12 at 12:58
  • @janneb: what exactly do you mean? I'm inclined to think about bus contention, which doesn't apply to this case since there's a single writer... – Ismael Luceno Jan 03 '12 at 16:42
  • @Ismael: See e.g. https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf for an easily understandable explanation of the x86 memory model. The allowed reorderings might not apply to this case, but then the OP's code looks so broken I can't make heads or tails of it anyway. – janneb Jan 03 '12 at 17:47
  • Janneb, why do you find my code broken? What is wrong with it? I think it should work perfectly for the x86 model. I have not tried it without atomic_t, but I think in that case (when atomic_t is not used) volatile would suffice. Prove me, how it would not work for x86, if you can. – MetallicPriest Jan 03 '12 at 18:13
  • Ismail: true, here there is a single writer and a single reader, so I don't think atomic_t is necessary. Volatile would be required though if we assume that the compiler can cache away log_flag. – MetallicPriest Jan 03 '12 at 18:17
  • @MetallicPriest: there's no convenience in using `volatile` over a memory barrier. In fact, unless you take a lot of care (and thus generally translates into writing more code), `volatile` will produce poor machine code. – Ismael Luceno Jan 04 '12 at 06:18
  • @janneb: I am well-aware of implementations since I've suffered x86 (i.e. worked with :P) on distributed computing in the past. But it's an interesting read. This has been traditionally a grey area, and I've heard of some x86 NUMA implementations that diverge (but never had the chance to try one :/). – Ismael Luceno Jan 04 '12 at 08:06
  • @Ismael: Yes, it was poorly specified in the past. I guess neither Intel nor AMD wanted to tie their hands, thus allowing a future processor to improve performance by doing more aggressive re-ordering. What changed, and caused the vendors to publish memory model specs (and thus guarantee the specified architectural behavior going into the future) was, I think, a combination of 1) Increased availability of SMP via cheap multicore processors forced them to specify something to enable programmers to write robust code, and 2) – janneb Jan 04 '12 at 09:27
  • @Ismael: 2) various memory speculation features in processors allowed good performance while still having a much less weak architectural model compared to some RISC architectures like PowerPC. – janneb Jan 04 '12 at 09:28
  • 1
    @MetallicPriest: Well, broken is perhaps a bit strong, sorry. Instead, I'll say subtle and fragile, and in the absence of benchmarks showing that the OS provided synchronization primitives are too slow, pointless for production code (it's fine as a learning exercise, of course!). Also, what is atomic_t? Neither C, C++, GCC or POSIX has any type like that, and Googling only finds some Linux kernel internal type. So presumably it's your own type(def). Thus for all we know, it might be a type for which x86 does not guarantee atomic reads/writes. – janneb Jan 04 '12 at 10:00
  • @janneb: Indeed (It's sad x86 isn't an incoherent NUMA-like model, that would make our lifes a lot easier). – Ismael Luceno Jan 04 '12 at 10:13
1

As long as log_flag is atomic you will be fine.

If log_flag was just a regular bool, you have no guarantee it will work.

The compiler could reorder you instructions

*log_flag = 1;
memcpy( log_data, data, size );

This is semantically identical on a uniprocessor system as long as log_flag is not accessed inside memcpy. Your only saving grace may be an inferior optimizer that cant deduce what variables are accessed in memcpy.

The cpu can reorder your instructions
It may choose to load the log_flag before the loop to optimize the pipeline.

The cache may reorder you memory writes.
The cache line that contains log_flag may get synced to the other processor before the cache line containing data.

What you need is a way to tell the compiler, cpu, and cache "hands off", so that they don't make assumptions about the order. That can only be done with a memory fence. std::atomic, std::mutex, and semaphore all have the correct memory fence instructions embedded in their code.

deft_code
  • 57,255
  • 29
  • 141
  • 224