1

I have multiple processes running simultaneously on the same box under CentOS 7 (each one on behalf of the separate Linux user).

I use Zabbix for monitoring.

Sometimes the following pattern appears on CPU utilization graph.

enter image description here

If you zoom in, then it looks like

enter image description here

So the server freezes for some time, and even SSH login does not work (as well as other processes are not working as expected, of course) and Zabbix agent fails to send its data to Zabbix server (Zabbix server is located on separate host).

As I understand from the Zabbix legend the yellow part of the chart is iowait.

enter image description here

So could you explain how the iowait of one process can affect the whole system so drastically?

And how is it possible to prevent and to restrict this behaviour?

zavg
  • 123
  • 8

1 Answers1

4

It is not a process, it is the time things are waiting for IO.

I would say you possibly have a hard disc there that is totally overloaded at those times. Like ridiculously overloaded - possibly by:

  • Extremely bad programming that does not buffer things in memory.
  • Extremely bad hardware selection (i.e. a hard disc where there simply is not enough IO budget and a SSD is needed).
  • Extremely faulty hardware (bad sectors on a HD) that make it go into some sort of retry pattern that takes some time.
  • Standard usage. If you have a database that does reorganize indices it will try t do so as fast as possible and it is possible to cause serious IO spikes regardless what hardware you throw at it.

OBVIOUSLY it could also be some software bug in a driver, but given that this is a pro forum I would assume you have made sure to be current on service packs.

You will have to start analyzing what is happening that causes excessive IO. I.e. you have to look at the IO wait statistics of the processes, not the system totals.

Given that a LOT of things are doing IO - and often wait for it to complete - it is not surprising that a total IO overload causes all kinds of weird behavior.

Esa Jokinen
  • 46,944
  • 3
  • 83
  • 129
TomTom
  • 51,649
  • 7
  • 54
  • 136
  • Thank you so much for such a detailed answer. Just after posting a question and reading some docs I understand that the excessive disk IO is the cause of an issue. However thank you again for your contribution! – zavg Jan 04 '20 at 09:44
  • 2
    Out of my experience- i have seen discs do overloaded that the resposne time was measured in seconds (instead of single or low double digit -12 or 14 - milliseconds). This KILLS performance on anything. Pre SSD IO is the hardest to scale issue - you end up with hundreds of hard discs before even coming close to the IO capacity of a low end SSD. – TomTom Jan 04 '20 at 10:39
  • So it is high chance that using of SSD instead of HDD should eliminate the issue, how do you think? – zavg Jan 04 '20 at 11:04
  • 1
    Given that SSD have 1000+ times the IO budget - yes, this should cover the issue. Make sure to have an SSD with appropriate write budget - not a cheap very limited writes per day model. THAT SAID: that may just cover the problem (which may be extremely bad programing - been there, seen that, too) and in this case throwing hardware at the problem will NOT solve it mid term. Some people msut learn what indices are ;) – TomTom Jan 04 '20 at 14:59