11

I have a problem with high CPU cunsumption by the linux kernel, while bootstrapping my java applications on server. This problem occurs only in production, on dev servers everything is light-speed.

upd9: There was two question about this issue:

  1. How to fix it? - Nominal Animal suggested to sync and drop everything, and it really helps. sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; Works. upd12: But indeed sync is enough.

  2. Why this happening? - It is still open for me, I do understand that flushing durty pages to disk consumes Kernel CPU and IO time, it is normal. But what is strage, why even single threaded application written in "C" my load ALL cores by 100% in kernel space?

Due to ref-upd10 and ref-upd11 I have an idea that echo 3 > /proc/sys/vm/drop_caches is not required to fix my problem with slow memory allocation. It should be enough to run `sync' before starting memory-consuming application. Probably will try this tommorow in production and post results here.

upd10: Lost of FS caches pages case:

  1. I executed cat 10GB.fiel > /dev/null, then
  2. sync to be sure, no durty pages (cat /proc/meminfo |grep ^Dirty displayed 184kb.
  3. Checking cat /proc/meminfo |grep ^Cached I got: 4GB cached
  4. Running int main(char**) I got normal performance (like 50ms to initialize 32MB of allocated data).
  5. Cached memory reduced to 900MB
  6. Test summary: I think it is no problem for linux to reclaim pages used as FS cache into allocated memory.

upd11: LOTS of dirty pages case.

  1. List item

  2. I run my HowMongoDdWorks example with commented read part, and after some time

  3. /proc/meminfo said that 2.8GB is Dirty and a 3.6GB is Cached.

  4. I stopped HowMongoDdWorks and run my int main(char**).

  5. Here is part of result:

    init 15, time 0.00s x 0 [try 1/part 0] time 1.11s x 1 [try 2/part 0] time 0.04s x 0 [try 1/part 1] time 1.04s x 1 [try 2/part 1] time 0.05s x 0 [try 1/part 2] time 0.42s x 1 [try 2/part 2] time 0.04s

  6. Summary by test: lost of durty pages significantly slow down first access to allocated memory (to be fair, this starts to happen only when total application memory starts to be comparable to whole OS memory, i.e. if you have 8 of 16 GB free, it is no problem to allocate 1GB, slowdown starst from 3GB or so).

Now I managed to reproduce this situation in my dev environment, so here is new details.

Dev machine configuration:

  1. Linux 2.6.32-220.13.1.el6.x86_64 - Scientific Linux release 6.1 (Carbon)
  2. RAM: 15.55 GB
  3. CPU: 1 X Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz (4 threads) (physical)

It's 99.9% that problem caused by large amount of durty pages in FS cache. Here is the application which creates lots on dirty pages:

import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Random;

/**
 * @author dmitry.mamonov
 *         Created: 10/2/12 2:53 PM
 */
public class HowMongoDdWorks{
    public static void main(String[] args) throws IOException {
        final long length = 10L*1024L*1024L*1024L;
        final int pageSize = 4*1024;
        final int lengthPages = (int) (length/pageSize);
        final byte[] buffer = new byte[pageSize];
        final Random random = new Random();
        System.out.println("Init file");
        final RandomAccessFile raf = new RandomAccessFile("random.file","rw");
        raf.setLength(length);
        int written = 0;
        int readed = 0;
        System.out.println("Test started");
        while(true){
            { //write.
                random.nextBytes(buffer);
                final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
                raf.seek(randomPageLocation);
                raf.write(buffer);
                written++;
            }
            { //read.
                random.nextBytes(buffer);
                final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
                raf.seek(randomPageLocation);
                raf.read(buffer);
                readed++;
            }
            if (written % 1024==0 || readed%1024==0){
                System.out.printf("W %10d R %10d pages\n", written, readed);
            }

        }
    }
}

And here is test application, which causes HI (up to 100% by all cores) CPU load in Kernel Space (same as below, but I will copy it once again).

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
   int last = clock(); //remember the time
   for(int i=0;i<16;i++){ //repeat test several times
      int size = 256 * 1024 * 1024;
      int size4=size/4;
      int* buffer = malloc(size); //allocate 256MB of memory
      for(int k=0;k<2;k++){ //initialize allocated memory twice
          for(int j=0;j<size4;j++){ 
              //memory initialization (if I skip this step my test ends in 
              buffer[j]=k; 0.000s
          }
          //printing 
          printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
          last = clock();
      }
   }
   return 0;
}

While previous HowMongoDdWorks program is running, int main(char** argv) will show results like this:

x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...

I keep everything below this line only for historical perpose.


upd1: both, dev and production systems are big enought for this test. upd7: it is not paging, at least I have not seen any storage IO activity during problem time.

  1. dev ~ 4 Cores, 16 GM RAM, ~8 GB free
  2. production ~ 12 Cores, 24 GB RAM, ~ 16 GB free (from 8 to 10 GM is under FS Cache, but it no difference, same results even if all 16GM is totally free), also this machine is loaded by CPU, but not too high ~10%.

upd8(ref): New test case and and potentional explanation see in tail.

Here is my test case (I also tested java and python, but "c" should be most clear):

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
   int last = clock(); //remember the time
   for(int i=0;i<16;i++){ //repeat test several times
      int size = 256 * 1024 * 1024;
      int size4=size/4;
      int* buffer = malloc(size); //allocate 256MB of memory
      for(int k=0;k<2;k++){ //initialize allocated memory twice
          for(int j=0;j<size4;j++){ 
              //memory initialization (if I skip this step my test ends in 
              buffer[j]=k; 0.000s
          }
          //printing 
          printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
          last = clock();
      }
   }
   return 0;
}

The output on dev machine (partial):

x [1] 0.13 --first initialization takes a bit longer
x [2] 0.12 --then second one, but the different is not significant.
x [1] 0.13
x [2] 0.12
x [1] 0.15
x [2] 0.11
x [1] 0.14
x [2] 0.12
x [1] 0.14
x [2] 0.12
x [1] 0.13
x [2] 0.12
x [1] 0.14
x [2] 0.11
x [1] 0.14
x [2] 0.12 -- and the results is quite stable
...

The output on production machine (partial):

x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...

While running this test on development machine the CPU usage is not even rised from gound, like all cores is less then 5% usage in htop.

But running this test on production machine I see up to 100% CPU usage by ALL cores (in average load rises up to 50% on 12 cores machine), and it's all Kernel time.

upd2: all machines has same centos linux 2.6 installed, I work with them using ssh.

upd3: A: It's unlikely to be swapping, haven't seen any disk activity during my test, and plenty of RAM is also free. (also, descriptin is updated). – Dmitry 9 mins ago

upd4: htop say HI CPU utilisation by Kernel, up to 100% utilisation by al cores (on prod).

upd5: does CPU utilization settle down after initialization completes? In my simple test - Yes. For real application, it is only helping to stop everything else to start a new program (which is nonsense).

I have two questions:

  1. Why this happening?

  2. How to fix it?

upd8: Improved test and explain.

#include<stdlib.h>
#include<stdio.h>
#include<time.h>

int main(char** argv){
    const int partition = 8;
   int last = clock();
   for(int i=0;i<16;i++){
       int size = 256 * 1024 * 1024;
       int size4=size/4;
       int* buffer = malloc(size);
       buffer[0]=123;
       printf("init %d, time %.2fs\n",i, (clock()-last)/(double)CLOCKS_PER_SEC);
       last = clock();
       for(int p=0;p<partition;p++){
            for(int k=0;k<2;k++){
                for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
                    buffer[j]=k;
                }
                printf("x [try %d/part %d] time %.2fs\n",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
                last = clock();
            }
      }
   }
   return 0;
}

And result looks like this:

init 15, time 0.00s -- malloc call takes nothing.
x [try 1/part 0] time 0.07s -- usually first try to fill buffer part with values is fast enough.
x [try 2/part 0] time 0.04s -- second try to fill buffer part with values is always fast.
x [try 1/part 1] time 0.17s
x [try 2/part 1] time 0.05s -- second try...
x [try 1/part 2] time 0.07s
x [try 2/part 2] time 0.05s -- second try...
x [try 1/part 3] time 0.07s
x [try 2/part 3] time 0.04s -- second try...
x [try 1/part 4] time 0.08s
x [try 2/part 4] time 0.04s -- second try...
x [try 1/part 5] time 0.39s -- BUT some times it takes significantly longer then average to fill part of allocated buffer with values.
x [try 2/part 5] time 0.05s -- second try...
x [try 1/part 6] time 0.35s
x [try 2/part 6] time 0.05s -- second try...
x [try 1/part 7] time 0.16s
x [try 2/part 7] time 0.04s -- second try...

Facts I learned from this test.

  1. Memory allocation itself is fast.
  2. First access to allocated memory is fast (so it is not a lazy buffer allocation problem).
  3. I split allocated buffer into parts (8 in test).
  4. And filling each buffer part with value 0, and then with value 1, printing consumed time.
  5. Second buffer part fill is always fast.
  6. BUT furst buffer part fill is always a bit slower then second fill (I belive some extra work is done my kernel on first page access).
  7. Some Times it takes SIGNIFICANTLY longer to fill buffer part with values at first time.

I tried suggested anwser and it seems it helped. I will recheck and post results again later.

Looks like linux maps allocated pages to durty file system cache pages, and it takes a lot of time to flush pages to disk one by one. But total sync works fast and eliminates problem.

Dmitry
  • 123
  • 1
  • 6
  • Q: What does "top" say? Q: does CPU utilization settle down after initialization completes? ... and, most important ... Q: Do you have enough RAM on your servers? Do you have any evidence that any "swapping" is occurring? – paulsm4 Sep 28 '12 at 22:13
  • Do you run the test on the development machine directly, and on the production machine through SSH or similar? – jpsecher Sep 28 '12 at 22:16
  • This looks like it has something to do with paging. I think on the long initialization the application has run out of pages, and so asks Linux for more. If memory is full (or not enough space for those pages), Linux has to spend time paging things out. I think you're experiencing thrashing. What else is running on the server? – Linuxios Sep 29 '12 at 02:01
  • 1
    12 cores means multi-chip? With separate caches? Causing cache-sync problems? – Bo Persson Sep 29 '12 at 08:40
  • upd8 is correct. When an application allocates memory in Linux, the kernel only sets up the mappings (pages), but does not actually populate any RAM in them. That only happens when you first read or write to the mapping; it also happens on a page-by-page basis (a page being `sysconf(_SC_PAGESIZE)` bytes). The kernel uses clean (not dirty, i.e. not modified compared to disk) page table pages first, so normally the initial delay is minimal. In your prod env, most page cache pages are dirty, and must be flushed to disk before reuse. That is why doing it by hand first makes everything faster. – Nominal Animal Sep 30 '12 at 17:54

2 Answers2

8

Run

sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'

on your dev machine. It is a safe, nondestructive way to make sure your caches are empty. (You will not lose any data by running the above command, even if you happen to save or write to disk at the exact same time. It really is safe.)

Then, make sure you don't have any Java stuff running, and re-run the above command just to be sure. You can check if you have any Java running for example

ps axu | sed -ne '/ sed -ne /d; /java/p'

It should output nothing. If it does, close your Java stuff first.

Now, re-run your application test. Does the same slowdown now occur on your dev machine, too?

If you care to leave the comment either way, Dmitry, I'd be happy to explore the issue further.

Edited to add: I suspect that the slowdown does occur, and is due to the large startup latency incurred by Java itself. It is a very common issue, and basically built-in to Java, a result of its architecture. For larger applications, startup latency is often a significant fraction of a second, no matter how fast the machine, simply because Java has to load and prepare the classes (mostly serially, too, so adding cores will not help).

In other words, I believe the blame should fall on Java, not Linux; quite the opposite, since Linux manages to mitigate the latency on your development machine through kernel-level caching -- and that only because you keep running those Java components practically all the time, so the kernel knows to cache them.

Edit 2: It would be very useful to see which files your Java environment accesses when your application is started. You can do this with strace:

strace -f -o trace.log -q -tt -T -e trace=open COMMAND...

which creates file trace.log containing the open() syscalls done by any of the processes started by COMMAND.... To save the output to trace.PID for each process the COMMAND... starts, use

strace -f -o trace -ff -q -tt -T -e trace=open COMMAND...

Comparing the outputs on your dev and prod installations will tell you if they are truly equivalent. One of them may have extra or missing libraries, affecting the startup time.

In case an installation is old and system partition is reasonably full, it is possible that those files have been fragmented, causing the kernel to spend more time waiting for I/O to complete. (Note that the amount of I/O stays the same; only the time it takes to complete will increase if the files are fragmented.) You can use command

LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' trace.* \
| LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r -d '\n' filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
  END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g

to check how fragmented the files used by your application are; it reports how many files use only one, or more than one extents. Note that it does not include the original executable (COMMAND...), only the files it accesses.

If you just want to get the fragmentation statistics for files accessed by a single command, you can use

LANG=C LC_ALL=C strace -f -q -tt -T -e trace=open COMMAND... 2>&1 \
| LANG=C LC_ALL=C sed -ne 's|^[0-9:.]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
  END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g

If the problem is not due to caching, then I think it is most likely that the two installations are not truly equivalent. If they are, then I'd check the fragmentation. After that, I'd do a full trace (omitting the -e trace=open) on both environments to see where exactly the differences are.


I do believe I now understand your problem/situation.

On your prod environment, kernel page cache is mostly dirty, i.e. most cached stuff is stuff that is going to be written to disk.

When your application allocates new pages, the kernel only sets up the page mappings, it does not actually give physical RAM immediately. That only happens on the first access to each page.

On the first access, the kernel first locates a free page -- typically, a page that contains "clean" cached data, i.e. something read from the disk but not modified. Then, it clears it to zeros, to avoid information leaks between processes. (When using the C library allocation facilities like malloc() etc., instead of the direct mmap() family of functions, the library may use/reuse parts of the mapping. Although the kernel does clear the pages to zeros, the library may "dirty" them. Using mmap() to get anonymous pages you do get them zeroed out.)

If the kernel does not have suitable clean pages at hand, it must first flush some of the oldest dirty pages to disk first. (There are processes inside the kernel that flush pages to disk, and mark them clean, but if the server load is such that pages are continuously dirtied, it is usually desirable to have mostly dirty pages instead of mostly clean pages -- the server gets more work done that way. Unfortunately, it does also mean an increase in the first page access latency, which you're now encountered.)

Each page is sysconf(_SC_PAGESIZE) bytes long, aligned. In other words, pointer p points to the start of a page if and only if ((long)p % sysconf(_SC_PAGESIZE)) == 0. Most kernels, I believe, do actually populate groups of pages in most cases instead of individual pages, thus increasing the latency of the first access (to each group of pages).

Finally, there may be some compiler optimization that plays havoc with your benchmarking. I recommend you write a separate source file for a benchmarking main(), and the actual work done on each iteration in a separate file. Compile them separately, and just link them together, to make sure the compiler does not rearrange the time functions wrt. the actual work done. Basically, in benchmark.c:

#define _POSIX_C_SOURCE 200809L
#include <time.h>
#include <stdio.h>

/* in work.c, adjust as needed */
void work_init(void);      /* Optional, allocations etc. */
void work(long iteration); /* Completely up to you, including parameters */
void work_done(void);      /* Optional, deallocations etc. */

#define PRIMING    0
#define REPEATS  100

int main(void)
{
    double          wall_seconds[REPEATS];
    struct timespec wall_start, wall_stop;
    long            iteration;

    work_init();

    /* Priming: do you want caches hot? */
    for (iteration = 0L; iteration < PRIMING; iteration++)
        work(iteration);

    /* Timed iterations */
    for (iteration = 0L; iteration < REPEATS; iteration++) {
        clock_gettime(CLOCK_REALTIME, &wall_start);
        work(iteration);
        clock_gettime(CLOCK_REALTIME, &wall_stop);
        wall_seconds[iteration] = (double)(wall_stop.tv_sec - wall_start.tv_sec)
                                + (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
    }

    work_done();

    /* TODO: wall_seconds[0] is the first iteration.
     *       Comparing to successive iterations (assuming REPEATS > 0)
     *       tells you about the initial latency.
    */

    /* TODO: Sort wall_seconds, for easier statistics.
     *       Most reliable value is the median, with half of the
     *       values larger and half smaller.
     *       Personally, I like to discard first and last 15.85%
     *       of the results, to get "one-sigma confidence" interval.
    */

    return 0;
}

with the actual array allocation, deallocation, and filling (per repeat loop) done in the work() functions defined in work.c.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • We tried sync, but without dropping caches. Can't say it helped much. Will try to follow your suggestion, situation with unflushed pages looks quite real. Also, it might not help because after several runs of me test program above fs cache goes down significantly, but problem still remains. Again, good idea, will try. – Dmitry Sep 29 '12 at 12:20
  • I have straced my application, while python uses mmap due to memory allocation, in java there was NO kernel level call at memory allocation loop at all. – Dmitry Sep 30 '12 at 12:08
  • Wow, I shot command you suggested `sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'` and it seems it really helped. Will try again tommorrow to make is sure. Also, I updated my test case, to display more details in slownewss (will updates description in minutes). – Dmitry Sep 30 '12 at 12:09
  • I confirm that suggested fix with sync, drop-caches fixing my problem. Also I managed to reproduce same situation in my test environment. P.s. probably it is easier to use GCC -O0 to turn off optimization. Will update post with more details later. – Dmitry Oct 02 '12 at 16:17
  • Solution is fine, but reasons is unclear for me. Will mark suggest as not accepted for a while. – Dmitry Oct 03 '12 at 13:12
  • 1
    I drop caches only for benchmarking purposes. I'll add a new answer on how to deal with the original issue (lots of dirty caches) separately; hopefully it's clearer that way. – Nominal Animal Oct 03 '12 at 16:52
2

When the kernel runs out of available clean pages, it must flush dirty pages to disk. Flushing a lot of dirty pages to disk looks like a high CPU load, because most kernel-side stuff require one or more pages (temporarily) to work. Essentially, the kernel is waiting for I/O to complete, even when the userspace applications called a non-I/O related kernel function.

If you run a microbenchmark in parallel, say a program that just continuously dirties a very large mapping over and over, and measures the CPU time (__builtin_ia32_rdtsc() if using GCC on x86 or x86-64) without calling any syscalls, you should see that this one gets plenty of CPU time even when the kernel is seemingly eating "all" of CPU time. Only when a process calls a kernel function (syscall) that internally requires some memory, will that call "block", stuck waiting in the kernel for the page flushing to yield new pages.

When running benchmarks, it is usually enough to just run sudo sh -c 'sync ; echo 3 >/proc/sys/vm/drop_caches ; sync' a couple of times before running the benchmark, to make sure there should be no undue memory pressure during the benchmark. I never use it in a production environment. (While it is safe to run, i.e. does not lose data, it is like killing mosquitoes using a sledgehammer: the wrong tool for the job.)

When you find in a production environment that your latencies start to grow too large due to kernel flushing dirty pages -- which it does at maximum device speed I believe, possibly causing a hiccup in application I/O speeds too --, you can tune the kernel dirty page flushing mechanism. Basically, you can tell the kernel to flush dirty pages much sooner to disk, and make sure there won't be that many dirty pages at any point in time (if possible).

Gregory Smith has written about the theory and tuning of the flushing mechanism here. In short, /proc/sys/vm/ contains kernel tunables you can modify. They are reset to defaults at boot, but you can easily write a simple init script to echo the desired values to the files at boot. If the processes running on the production machine do heavy I/O, you might also look at the filesystem tunables. At minimum, you should mount your filesystems (see /etc/fstab) using the relatime flag, so that file access times are only updated for the first access after the file has been modified or its status changed.

Personally, I also use a low-latency pre-emptible kernel with 1000 Hz timer for multimedia workstations (and for multimedia servers if I had any right now). Such kernels run user processes in shorter slices, and usually provide much better latencies, although maximum computation capability is slightly less. If your production services are latency-sensitive, I do recommend switching your production servers to such kernels.

Many distributions do provide such kernels already, but I find it much simpler to recompile distro kernels instead, or even switch to kernel.org kernels. The procedure is simple: you need kernel development and tools installed (on Debian variants, make-kpkg is VERY useful). To upgrade a kernel, you obtain new sources, configure the kernel (usually using current configuration as the basis -- make oldconfig), build the new kernel, and install the packages, before rebooting. Most people do find that just upgrading the hardware is more cost effective than recompiling distro kernels, but I find recompiling kernels very effortless myself. I don't automatically reboot for kernel upgrades anyway, so adding a simple step (triggered by running a single script) prior to rebooting is not too much effort for me.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • You're very welcome. I do not care whether any of my answers are accepted, but I'm very interested to find out which solution gives you the best results. There is yet another approach, dynamically adjusting process CPU and I/O priorities (`nice` and `ionice`), which can be *very* useful for service management, but I don't think they can help much with latencies. I for one will appreciate it very much, Dmitry, if you keep us informed of your results. :) – Nominal Animal Oct 04 '12 at 16:01