We have very strange problem on our Web-project.
We use:
2 Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
12 GB memory
We have about 20 hits per seconds. 4-5 requests per second are heavy – it is a search requests.
We use nginx + php-fpm (5.3.22)
MySQL server installed on another machine.
Most of time we have load average less than 10 and cpu usage about 50%
Sometimes we get cpu usage about 95% and after that load average grows to 50 and more!!!
You can see Load Average and CPU Usage here (my reputation low to send images here)
Load Average
CPU Usage
We have to reload php-fpm ( /etc/init.d/php-fpm reload) to normalize situation.
This can happens 4-5 times per day.
I tried to use strace to exam this situation.
Sorry for long logs! This output of command strace -cp PID
PID – is the random php-fpm process id (We start 100 php-fpm processes).
This two results in the moment with high cpu usage.
Process 17272 attached - interrupt to quit
Process 17272 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
65.56 0.008817 267 33 munmap
13.38 0.001799 900 2 clone
9.66 0.001299 2 589 read
7.43 0.000999 125 8 mremap
2.84 0.000382 1 559 96 access
0.59 0.000080 40 2 waitpid
0.29 0.000039 0 627 gettimeofday
0.16 0.000022 0 346 write
0.04 0.000006 0 56 getcwd
0.04 0.000005 0 348 poll
0.00 0.000000 0 55 open
0.00 0.000000 0 69 close
0.00 0.000000 0 17 chdir
0.00 0.000000 0 189 time
0.00 0.000000 0 28 lseek
0.00 0.000000 0 2 pipe
0.00 0.000000 0 17 times
0.00 0.000000 0 8 brk
0.00 0.000000 0 8 getrusage
0.00 0.000000 0 18 setitimer
0.00 0.000000 0 8 flock
0.00 0.000000 0 1 nanosleep
0.00 0.000000 0 11 rt_sigaction
0.00 0.000000 0 13 rt_sigprocmask
0.00 0.000000 0 6 pread64
0.00 0.000000 0 7 pwrite64
0.00 0.000000 0 33 mmap2
0.00 0.000000 0 18 4 stat64
0.00 0.000000 0 34 lstat64
0.00 0.000000 0 92 fstat64
0.00 0.000000 0 63 fcntl64
0.00 0.000000 0 53 clock_gettime
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 1 connect
0.00 0.000000 0 9 accept
0.00 0.000000 0 1 send
0.00 0.000000 0 21 recv
0.00 0.000000 0 9 1 shutdown
0.00 0.000000 0 1 getsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.013448 3363 102 total
[root@hp-php ~]# strace -cp 30767
Process 30767 attached - interrupt to quit
Process 30767 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.88 0.016926 220 77 munmap
29.06 0.009301 2 4343 read
8.73 0.002794 466 6 clone
3.59 0.001149 0 5598 time
3.18 0.001017 0 3745 write
1.12 0.000358 0 7316 gettimeofday
0.64 0.000205 1 164 fcntl64
0.39 0.000124 21 6 waitpid
0.22 0.000070 0 1496 326 access
0.13 0.000041 0 3769 poll
0.03 0.000009 0 151 close
0.02 0.000008 0 114 clock_gettime
0.02 0.000007 0 110 getcwd
0.00 0.000000 0 112 open
0.00 0.000000 0 38 chdir
0.00 0.000000 0 47 lseek
0.00 0.000000 0 6 pipe
0.00 0.000000 0 38 times
0.00 0.000000 0 135 brk
0.00 0.000000 0 3 ioctl
0.00 0.000000 0 14 getrusage
0.00 0.000000 0 38 setitimer
0.00 0.000000 0 19 flock
0.00 0.000000 0 40 mlock
0.00 0.000000 0 40 munlock
0.00 0.000000 0 6 nanosleep
0.00 0.000000 0 27 rt_sigaction
0.00 0.000000 0 31 rt_sigprocmask
0.00 0.000000 0 13 pread64
0.00 0.000000 0 18 pwrite64
0.00 0.000000 0 78 mmap2
0.00 0.000000 0 111 10 stat64
0.00 0.000000 0 49 lstat64
0.00 0.000000 0 182 fstat64
0.00 0.000000 0 8 socket
0.00 0.000000 0 8 5 connect
0.00 0.000000 0 19 accept
0.00 0.000000 0 7 send
0.00 0.000000 0 66 recv
0.00 0.000000 0 3 recvfrom
0.00 0.000000 0 20 1 shutdown
0.00 0.000000 0 5 setsockopt
0.00 0.000000 0 4 getsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.032009 28080 342 total
Yes, out scripts reads much information. This is normal.
But why munmap works very long??!! And when we have problem munmap ALWAYS in top!
For comparison this is result of trace random php-fpm process in regular situation:
Process 28606 attached - interrupt to quit
Process 28606 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
45.72 0.001816 1 2601 read
32.88 0.001306 435 3 clone
9.19 0.000365 0 2175 write
6.95 0.000276 0 7521 time
2.24 0.000089 0 4158 gettimeofday
2.01 0.000080 1 114 brk
0.28 0.000011 0 2166 poll
0.20 0.000008 0 833 155 access
0.20 0.000008 0 53 recv
0.18 0.000007 2 3 waitpid
0.15 0.000006 0 18 munlock
0.00 0.000000 0 69 open
0.00 0.000000 0 96 close
0.00 0.000000 0 29 chdir
0.00 0.000000 0 36 lseek
0.00 0.000000 0 3 pipe
0.00 0.000000 0 29 times
0.00 0.000000 0 10 getrusage
0.00 0.000000 0 5 munmap
0.00 0.000000 0 1 ftruncate
0.00 0.000000 0 29 setitimer
0.00 0.000000 0 1 sigreturn
0.00 0.000000 0 11 flock
0.00 0.000000 0 18 mlock
0.00 0.000000 0 5 nanosleep
0.00 0.000000 0 19 rt_sigaction
0.00 0.000000 0 24 rt_sigprocmask
0.00 0.000000 0 6 pread64
0.00 0.000000 0 12 pwrite64
0.00 0.000000 0 69 getcwd
0.00 0.000000 0 5 mmap2
0.00 0.000000 0 35 7 stat64
0.00 0.000000 0 41 lstat64
0.00 0.000000 0 96 fstat64
0.00 0.000000 0 108 fcntl64
0.00 0.000000 0 87 clock_gettime
0.00 0.000000 0 5 socket
0.00 0.000000 0 4 4 connect
0.00 0.000000 0 16 2 accept
0.00 0.000000 0 8 send
0.00 0.000000 0 15 shutdown
0.00 0.000000 0 4 getsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.003972 20541 168 total
Process 29168 attached - interrupt to quit
Process 29168 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.81 0.002366 1 1717 read
26.41 0.001140 1 1696 poll
8.29 0.000358 0 1662 write
7.37 0.000318 2 131 121 stat64
1.53 0.000066 0 3249 gettimeofday
1.18 0.000051 0 746 525 access
0.23 0.000010 0 27 fcntl64
0.19 0.000008 0 62 brk
0.00 0.000000 0 1 restart_syscall
0.00 0.000000 0 7 open
0.00 0.000000 0 16 close
0.00 0.000000 0 3 chdir
0.00 0.000000 0 1039 time
0.00 0.000000 0 1 lseek
0.00 0.000000 0 3 times
0.00 0.000000 0 3 ioctl
0.00 0.000000 0 1 getrusage
0.00 0.000000 0 4 munmap
0.00 0.000000 0 3 setitimer
0.00 0.000000 0 1 sigreturn
0.00 0.000000 0 1 flock
0.00 0.000000 0 1 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 2 pwrite64
0.00 0.000000 0 3 getcwd
0.00 0.000000 0 4 mmap2
0.00 0.000000 0 7 fstat64
0.00 0.000000 0 9 clock_gettime
0.00 0.000000 0 6 socket
0.00 0.000000 0 5 1 connect
0.00 0.000000 0 3 2 accept
0.00 0.000000 0 5 send
0.00 0.000000 0 64 recv
0.00 0.000000 0 3 recvfrom
0.00 0.000000 0 2 shutdown
0.00 0.000000 0 1 getsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.004317 10489 649 total
And you can see that munmap not in top.
Now we don’t have ideas how to solve this problem :(
We examined next potential problems and answers are "NO":
- additioan user activity
- long scripts execution (several seconds)
- using swap
Can you help us?