1

I suspect my root file system crashed.

tail /var/log/syslog => locks

dmesg doesn't show anything interesting

I figure I can reboot and rebuild the system if necessary as its imaged, however I'm wondering if there's another way to recover the system?

I can't use sudo because it locks up the system, so I need a way to gain root access and the root user is locked.

My next step after that would be unmount/remount root filesytem as readonly, start killing process to reclaim.

- update -

So it looks like two runaway process were locking up the system. I'm trying to kill -9 them and its not working. One turned zombie but still uses a lot of cpu, occasionally the other ignores all kill commands.. -9, -1 , -5, -15

I was finally able to sudo into root but sudo is behaving very strangely.

It doesn't prompt for password until I ctrl + c and password entry doesn't always seem to succeed -- I wonder if I've been hacked or the system is behaving strangely..

Now my system is at 99%, 0% wa

root:/tmp# shutdown -r now
Failed to start reboot.target: Connection timed out
See system logs and 'systemctl status reboot.target' for details.

root:/tmp# systemctl status reboot.target
Failed to get properties: Connection timed out
John K. N.
  • 2,055
  • 1
  • 17
  • 28
encore2097
  • 111
  • 2
  • Find and replace the offending disk. – Michael Hampton Dec 18 '17 at 06:06
  • Just to be clear, processes can't ignore SIGKILL (-9). The reason it doesn't die is because processes in a blocked state (currently executing a system call such as writing to the filesystem) can't be be killed. – jordanm Dec 18 '17 at 06:22
  • You missed the hardest kill, `kill -11`. Also, you want `ps -ww -fp ` to see exactly what is executing, there. @Michael, failing hard drives usually cause IOWait, in my experience. – thecarpy Dec 18 '17 at 06:46
  • @thecarpy *You missed the hardest kill, `kill -11`.* Despite it [going to 11](https://en.wikipedia.org/wiki/Up_to_eleven), there's nothing harder than `kill -9`. If that doesn't work, no other `kill` signal will, either. – Andrew Henle Dec 18 '17 at 12:28
  • More info: this is an embedded system, swap enabled on usb disk. Two processes heavily using swap but were nice/ionice to idle/best-effort . It seems this may be some race conditions / bugs across a bunch of programs and the kernel with this environment. A reboot set everything back to normal - my disk and rootfs appear OK. My conclusion is disk write / read failed for a program and the process hung on retry. After terminating the processes, system was no longer stuck on io but on waiting for a system call to return, which is why it went from 99% wa -> 0% wa and 0 % sys -> 100% sys. – encore2097 Dec 18 '17 at 22:19
  • Aside: having system management, ie. systemctl, 'talk' to core system programs over a communication protocol is not a good idea. It may be useful for scripting / API, but terrible when a system is heavily loaded. Documentation on bypassing the protocol would be helpful. It appears systemd is not a good use for me and I'll revert to init or roll my own. I like the unix philosophy, simple, effective and limited in scope. Complex actions achieved by chaining simple pieces, systemd while well intentioned adds unnecessary complexity. – encore2097 Dec 18 '17 at 22:38
  • Welcome to the edge cases that systemd doesn't like to acknowledge. Systemd is hosed and your machine needs to be hard-rebooted. Take your short list of services that *must* be shut down gracefully [eg: databases] and then go read their service definitions to figure out what signal to send to what PID, because systemd is in such a broken state that it can't do anything anymore. In the future you need to more closely monitor your resource usage because systemd performs *very* poorly under resource starvation compared to previous init daemons. – Sammitch Dec 19 '17 at 00:38

0 Answers0