0

This server runs several processes of satellite imagery, it has 256GB of RAM, 12TB disk, 64 CPU cores Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz. It should not fail with this charge but it does sometimes. This is a screen capture of a typical htop.

capture of a typical htop

When the system fails I can capture its last console message using the IPMI remote control. The last one is this:

last console before crash

With systemd failing to provide these services, the server is unable to work and we can't enter to fix it by ssh, we have to hard reset it. What should we do to prevent this problem?

EDIT: The server has one disk M.2 240GB for the operating system in / and the 12TB disk for /data. The system is ...

Linux tsom02 5.10.0-12-amd64 #1 SMP Debian 5.10.103-1 (2022-03-07) x86_64 GNU/Linux

The M2 is partitioned with only 28GB for /. Maybe that is the reason? Should I use more space for /?

The output of vmstat 5 5 is:

output of vmstat 5 5

djdomi
  • 1,599
  • 3
  • 12
  • 19
  • Does this machine only have 1 disk? It looks like it could be under heavy IO stress. Unfortunately htop doesn't display this information. Please provide the output from 'iostat 5 5' or 'vmstat 5 5' if iostat isn't installed. – wazoox Mar 11 '22 at 16:47
  • Done wazoox, I hope that helps. – user2309000 Mar 11 '22 at 17:06
  • 1
    please provide clear text instead of the pictures. it reduces a lot of load tine and makes the read easier, if you still want to use pictures, use an exclamation mark infront of. please provide the output of `df -h` – djdomi Mar 12 '22 at 07:50
  • OK, disk doesn't look too heavily loaded. In "last console before crash" there are lots of OOM errors (which can be fatal) and the OOM killer kills python3 processes. Do you know what's the killed application? It's extremely bizarre to have killed python3 processes on a 256GB RAM machine... – wazoox Mar 12 '22 at 18:31
  • Also do you have a swap partition? – wazoox Mar 12 '22 at 18:31
  • Yes, I have a 1Gi swap partition – user2309000 Mar 16 '22 at 20:35

0 Answers0