2

I've encountered a very strange issue that affects my code running only on OSX 10.10 systems. I've seen this anomaly occur on over 25 OSX 10.10 systems running my code, whereas the exact same code did not exhibit this behavior before upgrading (10.7). Furthermore, this issue is not 100% reproducible in that it occurs randomly ~0-5% of the time. While testing the code, nothing else critical or CPU exhaustive is occurring on the machine. Even if something else was happening, the very fact that the delays I'm experiencing are soooo ridiculously long make that conclusion seem suspicious. Anyhow, without further ado, take a look at the following NSLog prints:

12:00:05.766 MyApp[59957:6540517] Time_To_Wait: 679000000, Fire_Date: 270946738287700, Cur_Date: 270946059302734
12:00:26.446 MyApp[59957:6540517] Resume...

The times are in nanoseconds. According to the NSLog timestamps, we actually ended up waiting 20.68 seconds, rather than the desired 0.679 secs. Now, let's take a look at the code:

NSLog(@"Time_To_Wait: %lld, Fire_Date: %lld, Cur_Date: %lld", time_to_wait, fire_date, mach_absolute_time());
mach_wait_until(fire_date);
NSLog(@"Resume...");

In case you're wondering what mach_wait is, it's a high resolution timer available default. Just include

#include <mach/mach.h>
#include <mach/mach_time.h>

But that doesn't matter because I've experienced the exact same issue if I replace mach_wait_until with:

  • sleep()
  • usleep()
  • [NSThread sleepForTimeInterval:]
  • GCD's dispatch_after(,^{});
  • mach_wait_until()

Doesn't matter what method of delay I try to use, I use NSLog to print my 'delay' value to the console as a sanity check to ensure it's the correct # of ms, then fire one of the above. ~95% of the time it behaves according to specification. But occasionally randomly, I'll get some whack delay like 10-20 seconds. What gives? Is this some Yosemite kernel issue? Again, I've never experienced this w/ the same code running on 10.7. And I've tested all of the above methods for delaying the execution of code.


UPDATE:

Following comment advice, I went ahead and sudo dtruss -f -e sudo -u USER MyApp 2> ~/myapps.log. As usual, most of the mach_wait's and sleeps behaved the way they were supposed to. So to make things easier to identify, I added in a check where if the actual sleep duration is > 3x what it's supposed to be, it prints out DELAY ISSUE. Ran my program and did a search for delay issue. Listed below are the most frequent system calls (dtruss output) that occur between the log print immediately preceding and succeeding the mach_wait statements:

  • 99613/0xcf33b9: 30250 __semwait_signal(0xD07, 0x0, 0x1) = -1 Err#60
  • 99613/0xcf33b9: 16 workq_kernreturn(0x20, 0x0, 0x1) = 0 0

There are about ~250+ of the above two calls. Probably a few more workq than __semwait's. The semwait's all usually take about 30000 'units' of time to complete, whatever that is. Dunno how dtruss times. These make up the bulk of the calls.

  • 752/0xcf3320: 2787191 kevent64(0x3, 0x0, 0x0) = 1 0
  • 752/0xcf335d: 189948 select(0x40, 0x7FC080E18220, 0x7FC080E13B40, 0x0, 0x0) = 1 0
  • 752/0xcf335d: 1648403 select(0x40, 0x7FC080E18220, 0x7FC080E13B40, 0x0, 0x0) = 1 0

These three system calls took the longest # of units of time. And they were each appeared once in the output during our ~20sec wait gap.

Not sure what to do with the above =X

Pang
  • 9,564
  • 146
  • 81
  • 122
Authman Apatira
  • 3,994
  • 1
  • 26
  • 33
  • Should this not just be a bug report? (Not that I would blame anyone for not filing one on radar, one does rapidly get the feeling it's a waste of time) – Pascal Cuoq Feb 28 '15 at 18:42
  • Is it exactly 10s or 20s, or a random value in-beween? Because 20.68 is almost exactly 679 million nanoseconds + 20 seconds... – Iwillnotexist Idonotexist Feb 28 '15 at 18:43
  • Also, can you retry but with another logging API? I wonder if an unfair mutex is being heavily contended inside `NSLog`. Can you do the Mac OS X equivalent of Linux's `strace` on your application to see whether you're sleeping, locking locks, or instead spinning? – Iwillnotexist Idonotexist Feb 28 '15 at 18:56
  • I've seen 10sec, 11sec, 15sec, and 20sec delays. I really don't think it can be a NSLog issue (20sec for nslog mutex!?) but I'm of course willing to try other logging. I can do printf's with timer output. I dunno what OSX's equivalent to strace is though, unfortunately--? – Authman Apatira Feb 28 '15 at 20:39
  • [This link](https://opensourcehacker.com/2011/12/02/osx-strace-equivalent-dtruss-seeing-inside-applications-what-they-do-and-why-they-hang/) points to a roughly analogous tool called `dtruss`. Also, use unlocked versions of `fprintf` if possible. The idea is that you want to know where the stall is occuring: In the kernel scheduler (which would be an unlikely, and major, bug) or in Apple's userland libraries (much more likely)? – Iwillnotexist Idonotexist Feb 28 '15 at 21:19
  • @AuthmanApatira I'm digging in the source code for Yosemite's kernel. Download it from [here](https://opensource.apple.com/tarballs/xnu/xnu-2782.1.97.tar.gz) and dig starting in `xnu-2782.1.97/osfmk/kern/clock.c:810`. I strongly doubt it's the kernel's fault though. Can you report the return value of `mach_wait_until` in your NSLog those moments when there's a 20s delay? – Iwillnotexist Idonotexist Feb 28 '15 at 23:06
  • mach_wait_until gives me a 0 response. Today I ran 50 tests on various 10.10 machines, and noticed the delays were a lot of the cases precisely 20 seconds: 21:49:02.097 Time_To_Wait: 899000000. 21:49:22.997 Resume... 0. The last 0 is the mach_wait_until result. The 0.899s is the proper delay time. And from the prints 02.097->22.997 is exactly 20sec. Very strange!!! – Authman Apatira Mar 02 '15 at 03:42
  • @IwillnotexistIdonotexist I was able to capture dtruss output when a timing failure occurred. Updated the Post above with the appropriate info~ – Authman Apatira Mar 02 '15 at 06:00
  • Interesting.... `dtruss` times in microseconds. If I'm not mistaken, Err = 60 maps to `ETIMEDOUT`, an error that squares well with taking exactly 20s, but I don't see any sockets or network code! Also, can you sandwich the `mach_wait_until()` call with calls to `mach_absolute_time()` either side, and print both and their difference using `printf()` to make absolutely sure it's `mach_wait_until()` that's guilty? And can you run `dtruss` with instead of `-e`, the `-a` flag for all details, then put up the entire dtruss log on pastebin and drop a link here? – Iwillnotexist Idonotexist Mar 02 '15 at 06:31
  • You went from 10.7 to 10.10 did you notice the APIS changes regarding process priority and app nap/app sleep? – uchuugaka Mar 02 '15 at 06:32
  • @AuthmanApatira Basically, sandwich your `mach_wait_until` with a sequence of calls that resembles this: http://pastebin.com/RzLNxNa3 – Iwillnotexist Idonotexist Mar 02 '15 at 07:31
  • @AuthmanApatira I've also read that Mac OS X Mavericks and Yosemite introduced _timer coalescing_, which merges the firing times of timers together in order to do them in batches and save power. You can disable it by `sudo sysctl -w kern.timer.coalescing_enabled=0`. Does your bug get fixed by running the app with timer coalescing disabled? – Iwillnotexist Idonotexist Mar 02 '15 at 13:49
  • I've checked activity monitor to ensure the app (and all apps) had app nap disabled. I'll go ahead and try out the sandwiching mach_wait and also disabling coalescing of timers and report back shortly. – Authman Apatira Mar 02 '15 at 20:27
  • 1
    @IwillnotexistIdonotexist `sudo sysctl -w kern.timer.coalescing_enabled=0` did the trick. This bug has been haunting me for the past few months. I wasn't even aware T.C. was a separate feature from AppNap. I should buy you a cup of coffee ;-) Would you like to submit an official answer such that I can accept it? – Authman Apatira Mar 02 '15 at 21:04
  • @AuthmanApatira I'll submit an answer in a couple hours; Currently it is inconvenient for me to do so. In the meanwhile - are you sure TC off really put a full stop to your hangs, and that you aren't being lucky? Is Turned on - bug / Turned off - no bug reproducible? And what did the `mach_absolute_time()`'s around `mach_wait_until()` report? 20 seconds or 0.679? Definitely report this to Apple! – Iwillnotexist Idonotexist Mar 02 '15 at 21:50

1 Answers1

3

See comments above

The problem appears to be connected to Apple's Timer Coalescing, a new feature in Mac OS X 10.9 Mavericks that rejigs the firing times of timers by up to a few milliseconds to merge them, thus allowing the processor to make fewer power-state transitions and to remain idle for longer. The benefit is greatly reduced power consumption.

A white paper by Apple on the topic that mentions it briefly is here.

The solution that was retained in the comments is to disable TC, as follows:

sudo sysctl -w kern.timer.coalescing_enabled=0
Iwillnotexist Idonotexist
  • 13,297
  • 4
  • 43
  • 66
  • 1
    After running extensive tests, I've verified it was in fact TC that was causing the issue and (thank God) it is no longer reproducible with it disabled. – Authman Apatira Mar 04 '15 at 13:56