1

I have a database server which uses a Samsung 840 pro disk. The load is constantly higher than usual, even if there is not much activity on the website. So I suspect the disk has worn out. But how can I check that the disk i/o is the bottleneck?

Here are some snapshots that might be relevant:

 top - 03:02:11 up 766 days, 20:45,  1 user,  load average: 7.42, 6.89, 6.72
Tasks: 325 total,   1 running, 321 sleeping,   3 stopped,   0 zombie
%Cpu(s): 17.3 us,  0.4 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  13227468+total, 27130284 used, 10514440+free,    94308 buffers
KiB Swap:  3906556 total,     9136 used,  3897420 free.  3833216 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                         
21764 mysql     20   0 27.058g 0.021t  12164 S 576.0 17.2  17369,44 mysqld                                                          
  574 root      20   0       0      0      0 S   0.3  0.0 280:00.66 jbd2/sda1-8                                                     
 5585 root      20   0       0      0      0 S   0.3  0.0   0:08.04 kworker/18:0                                                    
    1 root      20   0   28692   4540   2964 S   0.0  0.0  42:51.98 systemd                                                         
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.50 kthreadd                                                        
    3 root      20   0       0      0      0 S   0.0  0.0 894:44.38 ksoftirqd/0                                                     
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                    
    6 root      20   0       0      0      0 S   0.0  0.0  21:07.91 kworker/u64:0                                                   
    8 root      20   0       0      0      0 S   0.0  0.0   2510:32 rcu_sched                                                       
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 

iotop output:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                         
21764 mysql     20   0 27.058g 0.021t  12164 S 576.0 17.2  17369,44 mysqld                                                          
  574 root      20   0       0      0      0 S   0.3  0.0 280:00.66 jbd2/sda1-8                                                     
 5585 root      20   0       0      0      0 S   0.3  0.0   0:08.04 kworker/18:0                                                    
    1 root      20   0   28692   4540   2964 S   0.0  0.0  42:51.98 systemd                                                         
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.50 kthreadd                                                        
    3 root      20   0       0      0      0 S   0.0  0.0 894:44.38 ksoftirqd/0                                                     
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                    
    6 root      20   0       0      0      0 S   0.0  0.0  21:07.91 kworker/u64:0                                                   
    8 root      20   0       0      0      0 S   0.0  0.0   2510:32 rcu_sched                                                       
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh                                                          
   10 root      rt   0       0      0      0 S   0.0  0.0   5:28.52 migration/0                                                     
   11 root      rt   0       0      0      0 S   0.0  0.0   3:15.12 watchdog/0                                                      
   12 root      rt   0       0      0      0 S   0.0  0.0   3:27.27 watchdog/1                                                      
   13 root      rt   0       0      0      0 S   0.0  0.0   3:19.37 migration/1                                                     
   14 root      20   0       0      0      0 S   0.0  0.0 190:10.26 ksoftirqd/1                                                     
   16 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H                                                    
   17 root      rt   0       0      0      0 S   0.0  0.0   3:19.65 watchdog/2                                                      
   18 root      rt   0       0      0      0 S   0.0  0.0   2:52.44 migration/2                                                     
   19 root      20   0       0      0      0 S   0.0  0.0 194:18.02 ksoftirqd/2                                                     
   21 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/2:0H                                                    
   22 root      rt   0       0      0      0 S   0.0  0.0   3:21.4

iostat -m (the database is on sda and the linux filesystem on sdb)

Linux 3.16.0-4-amd64 (back)     03/27/20    _x86_64_    (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.76    0.00    0.38    0.07    0.00   92.79

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdd               0.03         0.00         0.01      43327     384521
sdc               0.08         0.00         0.01     166547     748630
sdb               0.37         0.00         0.02      78269    1076710
sda               8.46         0.00         0.11      54407    7463246
Milkyway
  • 113
  • 3
  • Is the stat additive on linux? Because on Windows I would be more concerned about the extremely high - for a db sever - cpu usage combined witth extremely low (for a db server) mem usage. – TomTom Mar 27 '20 at 07:13
  • Yes it is on linux, as it is clear in the outputs. – Milkyway Mar 27 '20 at 07:14
  • And as per answer it is NOT - i.e. if the CPU is waiting for IO, it does not add to CPU%. – TomTom Mar 27 '20 at 07:22
  • @tomtom correct.( It can however reflect in the load figures - which just count the amount if work in the pipeline and does not look at the reason. A load of 20 could mean a server totally overloaded or quite responsive but with a lot of processes waiting on io - not uncommon on things like mail servers waiting on network io) – davidgo Mar 27 '20 at 07:34

1 Answers1

1

Everything posted above implies the CPU is the bottleneck -not the SSD. Most tellingly the 576% CPU, which I expect means 5.76 CPU cores being consumed.

The problem appears to relate to MySQL - maybe a race condition or complex query, corrupted table or bad indexing?

Were the problem the disk, I would gave expected to see low CPU and high IOWait, but IOWait is 0.07.

davidgo
  • 6,222
  • 3
  • 23
  • 41
  • So if understand correctly, the defining factor is `%iowait`. right? If so, what is the acceptable range of that, after which one can point to disk i/o as the bottleneck? – Milkyway Mar 27 '20 at 07:20
  • 1
    Can't answer that with a simple number as its a continuum and depends on what else is in use, especially how multi threaded the high requirement tasks are. The iostat manual defines iowait% as "Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.". The higher the number the more time the system is spending waiting for disk. – davidgo Mar 27 '20 at 07:27
  • If `%Cpu(s): wa` is constantly high i.e. the processor is mostly waiting for IO, then the high load averages are likely caused by a hard drive. – Esa Jokinen Mar 27 '20 at 14:56
  • 1
    That's not the case here. BTW your uptime is very high, which usually equals unpatched system. This time it may be a reason for unexpected state of mysqld. Wisdom: **Have you tried turning it off and on again?** – Esa Jokinen Mar 27 '20 at 14:58
  • @milkyway With what I see in top and htop, how can you get to 27 G of Virtual RAM in use? There is less than 4G of swap available for your 13G machine. – Wilson Hauck Mar 27 '20 at 17:08
  • @Milkyway Before you TURN it off and on again. At least try to close your MySQL in an orderly shutdown, Shutdown your OS, then turn CPU off for 30 seconds before power up and restart. 766 days of uptime is a long time. – Wilson Hauck Mar 27 '20 at 17:09
  • Additional information request. Post on pastebin.com and share the links. From your SSH login root, Text results of: B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; E) complete MySQLTuner report AND Optional very helpful information, if available includes - htop OR top for most active apps, ulimit -a for a Linux/Unix list of limits, iostat -xm 5 3 for IOPS by device and core/cpu count, for server workload tuning analysis to provide suggestions. – Wilson Hauck Mar 27 '20 at 17:12