We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013: I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.