What cause kernel to eat CPU on page_fault?

Question

hw/os: linux 4.9, 64G RAM.

16 daemons running. Each reading random short (100 bytes) pieces of 5GiB file accessing it as a memory mapped via mmap() at daemon startup. Each daemon reads its own file, so 16 5GiB files total.

Each daemon making maybe 10 reads per second. Not too much, disk load is rather small.

Sometimes (1 event in 5 minutes, no period, totally random) some random daemon stuck in kernel code with the followind stack (see picture) for 300 milliseconds. This does not corellate with major-faults: the major-faults go at constant rate about 100...200 per second. Disk reads are also constant.

What can cause this?

Text of the image: __list_del_entry isolate_lru_pages.isra.48 shrink_inactive_list shrink_node_memcg shrink_node node_reclaim get_page_from_freelist enqueue_task_fair sched_clock __alloc_pages_nodemask alloc_pages_vma handle_mm_fault __do_page_fault page_fault

So you're sure this was one single soft page fault that stays in the kernel for 300ms? Can you tell if the free-list is getting huge of fragmented or something? I don't think transparent hugepages are relevant for file-backed mmaps, so probably your `/sys/kernel/mm/transparent_hugepage/defrag` setting shouldn't matter, unless it's choosing this moment to defrag anonymous pages for another process? Or if this is a fault on an anonymous page, separate from the file-backed mappings you're using. — Peter Cordes, Oct 16 '20 at 02:26
@PeterCordes "soft page fault" - dont know what is "soft page fault". I don't know what kind of page fault I deal with. "if the free-list is getting huge of fragmented" - I dont know how to figure out that. "/sys/kernel/mm/transparent_hugepage/defrag" - thank you for that. I dont know how to find answer on most of your questions. — pavelkolodin, Oct 16 '20 at 12:35
@PeterCordes i think `madvise(MADV_RANDOM)` solved the problem. — pavelkolodin, Oct 16 '20 at 20:31
Ah, the kernel was trying to pre-fault / readahead from disk, delaying its handling of the fault for the actual page you did touch? Re: soft page fault, see https://en.wikipedia.org/wiki/Page_fault#Minor as opposed to major / hard (needs I/O) or invalid (segfault). — Peter Cordes, Oct 16 '20 at 20:34
@PeterCordes seems my soludion worked one day. After app restart, things returned to behave badly — pavelkolodin, Oct 20 '20 at 01:32

score 1 · Answer 1 · answered Oct 17 '20 at 08:50

You have shrink_node and node_reclaim functions in your stack. They are called to free memory (which is shown as buff/cache by free command-line tool): https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#reclaim

The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (high watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.

So your 64 GB RAM system has situation when there is no free memory left. This amount of memory is enough to hold a copy of 12 files of 5 GB each, and your daemons uses 16 files. Linux may read more data from files than it was required by application with Readahead technique ("Linux readahead: less tricks for more", ols 2007 pp273-284, man 2 readahead). MADV_SEQUENTIAL may also turn on this mechanism, https://man7.org/linux/man-pages/man2/madvise.2.html

   MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
  MADV_RANDOM
Expect page references in random order. (Hence, read ahead may be less useful than normally.)

Not sure how your daemons did open and read files, was MADV_SEQUENTIAL active for them or not (or was this flag added by glibc or any other library). Also there can be some effect from THP - Transpartent huge pages https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html. Normal 4.9 kernel is from 2016 and thp expansion for filesystems was planned in 2019 https://lwn.net/Articles/789159/, but if you use RHEL/CentOS, some features may be backported into fork of 4.9 kernel.

You should check free and cat /proc/meminfo output periodically to check how your daemons and linux kernel readahead uses memory.

Without either `MADV_SEQUENTIAL` or `MADV_RANDOM`, the kernel will do *some* read-ahead, just less aggressively than with `MADV_SEQUENTIAL`. So extra pollution from readahead can be explained without `MADV_SEQUENTIAL` being "on by default" or anything. — Peter Cordes, Oct 17 '20 at 10:44

What cause kernel to eat CPU on page_fault?

1 Answers1