I have a problem with high CPU cunsumption by the linux kernel, while bootstrapping my java applications on server. This problem occurs only in production, on dev servers everything is light-speed.
upd9: There was two question about this issue:
How to fix it? - Nominal Animal suggested to sync and drop everything, and it really helps.
sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ;
Works. upd12: But indeedsync
is enough.Why this happening? - It is still open for me, I do understand that flushing durty pages to disk consumes Kernel CPU and IO time, it is normal. But what is strage, why even single threaded application written in "C" my load ALL cores by 100% in kernel space?
Due to ref-upd10 and ref-upd11 I have an idea that echo 3 > /proc/sys/vm/drop_caches
is not required to fix my problem with slow memory allocation.
It should be enough to run `sync' before starting memory-consuming application.
Probably will try this tommorow in production and post results here.
upd10: Lost of FS caches pages case:
- I executed
cat 10GB.fiel > /dev/null
, then sync
to be sure, no durty pages (cat /proc/meminfo |grep ^Dirty
displayed 184kb.- Checking
cat /proc/meminfo |grep ^Cached
I got: 4GB cached - Running
int main(char**)
I got normal performance (like 50ms to initialize 32MB of allocated data). - Cached memory reduced to 900MB
- Test summary: I think it is no problem for linux to reclaim pages used as FS cache into allocated memory.
upd11: LOTS of dirty pages case.
List item
I run my
HowMongoDdWorks
example with commentedread
part, and after some time/proc/meminfo
said that 2.8GB isDirty
and a 3.6GB isCached
.I stopped
HowMongoDdWorks
and run myint main(char**)
.Here is part of result:
init 15, time 0.00s x 0 [try 1/part 0] time 1.11s x 1 [try 2/part 0] time 0.04s x 0 [try 1/part 1] time 1.04s x 1 [try 2/part 1] time 0.05s x 0 [try 1/part 2] time 0.42s x 1 [try 2/part 2] time 0.04s
Summary by test: lost of durty pages significantly slow down first access to allocated memory (to be fair, this starts to happen only when total application memory starts to be comparable to whole OS memory, i.e. if you have 8 of 16 GB free, it is no problem to allocate 1GB, slowdown starst from 3GB or so).
Now I managed to reproduce this situation in my dev environment, so here is new details.
Dev machine configuration:
- Linux 2.6.32-220.13.1.el6.x86_64 - Scientific Linux release 6.1 (Carbon)
- RAM: 15.55 GB
- CPU: 1 X Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz (4 threads) (physical)
It's 99.9% that problem caused by large amount of durty pages in FS cache. Here is the application which creates lots on dirty pages:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Random;
/**
* @author dmitry.mamonov
* Created: 10/2/12 2:53 PM
*/
public class HowMongoDdWorks{
public static void main(String[] args) throws IOException {
final long length = 10L*1024L*1024L*1024L;
final int pageSize = 4*1024;
final int lengthPages = (int) (length/pageSize);
final byte[] buffer = new byte[pageSize];
final Random random = new Random();
System.out.println("Init file");
final RandomAccessFile raf = new RandomAccessFile("random.file","rw");
raf.setLength(length);
int written = 0;
int readed = 0;
System.out.println("Test started");
while(true){
{ //write.
random.nextBytes(buffer);
final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
raf.seek(randomPageLocation);
raf.write(buffer);
written++;
}
{ //read.
random.nextBytes(buffer);
final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
raf.seek(randomPageLocation);
raf.read(buffer);
readed++;
}
if (written % 1024==0 || readed%1024==0){
System.out.printf("W %10d R %10d pages\n", written, readed);
}
}
}
}
And here is test application, which causes HI (up to 100% by all cores) CPU load in Kernel Space (same as below, but I will copy it once again).
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
int last = clock(); //remember the time
for(int i=0;i<16;i++){ //repeat test several times
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size); //allocate 256MB of memory
for(int k=0;k<2;k++){ //initialize allocated memory twice
for(int j=0;j<size4;j++){
//memory initialization (if I skip this step my test ends in
buffer[j]=k; 0.000s
}
//printing
printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
last = clock();
}
}
return 0;
}
While previous HowMongoDdWorks
program is running, int main(char** argv)
will show results like this:
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...
I keep everything below this line only for historical perpose.
upd1: both, dev and production systems are big enought for this test. upd7: it is not paging, at least I have not seen any storage IO activity during problem time.
- dev ~ 4 Cores, 16 GM RAM, ~8 GB free
- production ~ 12 Cores, 24 GB RAM, ~ 16 GB free (from 8 to 10 GM is under FS Cache, but it no difference, same results even if all 16GM is totally free), also this machine is loaded by CPU, but not too high ~10%.
upd8(ref): New test case and and potentional explanation see in tail.
Here is my test case (I also tested java and python, but "c" should be most clear):
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
int last = clock(); //remember the time
for(int i=0;i<16;i++){ //repeat test several times
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size); //allocate 256MB of memory
for(int k=0;k<2;k++){ //initialize allocated memory twice
for(int j=0;j<size4;j++){
//memory initialization (if I skip this step my test ends in
buffer[j]=k; 0.000s
}
//printing
printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
last = clock();
}
}
return 0;
}
The output on dev machine (partial):
x [1] 0.13 --first initialization takes a bit longer
x [2] 0.12 --then second one, but the different is not significant.
x [1] 0.13
x [2] 0.12
x [1] 0.15
x [2] 0.11
x [1] 0.14
x [2] 0.12
x [1] 0.14
x [2] 0.12
x [1] 0.13
x [2] 0.12
x [1] 0.14
x [2] 0.11
x [1] 0.14
x [2] 0.12 -- and the results is quite stable
...
The output on production machine (partial):
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...
While running this test on development machine the CPU usage is not even rised from gound, like all cores is less then 5% usage in htop.
But running this test on production machine I see up to 100% CPU usage by ALL cores (in average load rises up to 50% on 12 cores machine), and it's all Kernel time.
upd2: all machines has same centos linux 2.6 installed, I work with them using ssh.
upd3: A: It's unlikely to be swapping, haven't seen any disk activity during my test, and plenty of RAM is also free. (also, descriptin is updated). – Dmitry 9 mins ago
upd4: htop say HI CPU utilisation by Kernel, up to 100% utilisation by al cores (on prod).
upd5: does CPU utilization settle down after initialization completes? In my simple test - Yes. For real application, it is only helping to stop everything else to start a new program (which is nonsense).
I have two questions:
Why this happening?
How to fix it?
upd8: Improved test and explain.
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
const int partition = 8;
int last = clock();
for(int i=0;i<16;i++){
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size);
buffer[0]=123;
printf("init %d, time %.2fs\n",i, (clock()-last)/(double)CLOCKS_PER_SEC);
last = clock();
for(int p=0;p<partition;p++){
for(int k=0;k<2;k++){
for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
buffer[j]=k;
}
printf("x [try %d/part %d] time %.2fs\n",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
last = clock();
}
}
}
return 0;
}
And result looks like this:
init 15, time 0.00s -- malloc call takes nothing.
x [try 1/part 0] time 0.07s -- usually first try to fill buffer part with values is fast enough.
x [try 2/part 0] time 0.04s -- second try to fill buffer part with values is always fast.
x [try 1/part 1] time 0.17s
x [try 2/part 1] time 0.05s -- second try...
x [try 1/part 2] time 0.07s
x [try 2/part 2] time 0.05s -- second try...
x [try 1/part 3] time 0.07s
x [try 2/part 3] time 0.04s -- second try...
x [try 1/part 4] time 0.08s
x [try 2/part 4] time 0.04s -- second try...
x [try 1/part 5] time 0.39s -- BUT some times it takes significantly longer then average to fill part of allocated buffer with values.
x [try 2/part 5] time 0.05s -- second try...
x [try 1/part 6] time 0.35s
x [try 2/part 6] time 0.05s -- second try...
x [try 1/part 7] time 0.16s
x [try 2/part 7] time 0.04s -- second try...
Facts I learned from this test.
- Memory allocation itself is fast.
- First access to allocated memory is fast (so it is not a lazy buffer allocation problem).
- I split allocated buffer into parts (8 in test).
- And filling each buffer part with value 0, and then with value 1, printing consumed time.
- Second buffer part fill is always fast.
- BUT furst buffer part fill is always a bit slower then second fill (I belive some extra work is done my kernel on first page access).
- Some Times it takes SIGNIFICANTLY longer to fill buffer part with values at first time.
I tried suggested anwser and it seems it helped. I will recheck and post results again later.
Looks like linux maps allocated pages to durty file system cache pages, and it takes a lot of time to flush pages to disk one by one. But total sync works fast and eliminates problem.