How to diagnose out of space errors?

Question

Twice I've noticed out of space errors in my php app and both times I've received the error "No space left on device while writing config" when attempting to login via ssh to troubleshoot the problem.

I have plenty of free disk space and both times my app worked again after restarting my server manually through my hosting company's website. Obviously, this is an inconvenience but when I have customers, it'll be completely unacceptable.

I've checked /var/log/messages for anything that may help diagnose the problem. All I could find that seems relevant is:

rsyslogd[1032]: imjournal: fopen() failed for path: '/var/lib/rsyslog/imjournal.state.tmp': No space left on device [v8.2102.0-5.el8 try https://www.rsyslog.com/e/2013 ]

I do have this as a cron-job running twice a day: find /tmp -atime +1 -delete. I don't think this is causing the issue but I am not certain. Is this even a good way to clear /tmp?

I guess as a quick fix, I could tell php to call a bash script to restart the server every time it encounters an out of space error. This doesn't feel like a good idea though without understanding exactly what's malfunctioned and why.

I am using AlmaLinux 8.5 (very similar to Centos) with Nginx and php-fpm and I only have a VPS. I'll edit my question if you think there's any relevant information I should include.

Edit

There's no point showing the results of any commands, until I encounter the error again. I've created a webpage to execute the commands using shell_exec and to display the results on screen. At the time of the error, I should be able to run the commands because nothing is written to disk:

NB

A client SSL certificate and login details that only I have, is required to run php as root and to access this page so I'm not worried about the security implications of running php as root or calling shell_exec with user data.

@NikitaKipriyanov suggested trying to keep my SSH connection open. If I didn't already have things setup this way (php as root for admin stuff), then of course it'd make more sense to try disabling SSH from timing out instead.

I will provide an update when I encounter this error again and have some results from my tests. Feel free to put the commands you think I should be executing into an answer, as I may upvote, and I'll accept the answer if it leads to me fixing the problem.

Edit - potential progress

Considering, I am sure that my system does not actually run out of disk space, I was expecting the problem to be a process still using a deleted file, or an error relating to a process that has crashed. There are plenty of articles stating this can cause out of disk space errors, but nothing about diagnosing this specific cause.

However, I've noticed my inode count has increased from 3% to 7% overnight. I do intentionally store data in lots of small files, however that should only account for my inode count increasing by a handful of inodes. I have crontab automatically create and store backups. I do monitor this so I would notice any anomalies.

I think the problem is my php $_SESSION array is creating way too many temporary files. The size of the $_SESSION array can increase linearly, so at present the amount of data stored is always insignificant in size. I don't create any backups of the $_SESSION array, so currently, I won't notice this increasing. This is very easy for me to test and observe this so that will be the next step. I don't want to make any assumptions so I'm going to wait and see if the inode count approaches 100% causing a crash before I attempt to fix the problem.

NB

As soon as I've confirmed the problem, I will move this extra information into an answer.

/var/lib/rsyslog/imjournal.state.tmp is not the same thing as /tmp I guess, so your (unsafe!) cron job has nothing to do with the problem. Where is df -h, df -i, analysis with du and so on? — Nikita Kipriyanov, Mar 09 '23 at 17:21
@NikitaKipriyanov I haven't included them because I have lots of free space. At the time of the error, running `df` is not possible because I can't login via `ssh`. After a server restart, those commands simply show lots of free space. — Dan Bray, Mar 09 '23 at 17:24
@NikitaKipriyanov what would you suggest to make my cronjob safe? Even if it's completely unrelated to the problem, I'd much rather it be safe. — Dan Bray, Mar 09 '23 at 17:32
@MatthewIfe `------ Message Queues -------- key msqid owner perms used-bytes messages ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status ------ Semaphore Arrays -------- key semid owner perms nsem` — Dan Bray, Mar 09 '23 at 17:35
Please, ask dedicated question about that job. Don't ask multiple questions in one. Regarding "can't ssh", then you might want to, for example, have already running SSH connection to catch the moment when this happens. Or set up monitoring, but this way it wiill take longer to diagnose. Either way, you need to have a diagnostic information gathered when this happens or at least, when it is about to happen, you need to know what is going on, not guess. — Nikita Kipriyanov, Mar 09 '23 at 17:39
You may want to check inode count too but its typically not usually that. What is the output of `df -i`? — Matthew Ife, Mar 09 '23 at 17:39
@NikitaKipriyanov the out of space errors in php and when attempting to login to ssh are the same problem though. Setting up monitoring sounds like a good idea. Any ideas how I should proceed with that? I could edit my settings so that my `ssh` connection never times out, but I'm not certain it would stay open after receiving the error. I could certainly prepare some bash files attempt to execute them from `php` the next time I receive the error. — Dan Bray, Mar 09 '23 at 17:49
Install and configure Zabbix, Nagios, what I am talking about, plenty of various options. At least prepare df -h (space, human readable), df -i (inodes), and various others. If you will be able to have that at the moment of an error, that will be topic to diagnose further. Then you'll prepare new scripts, again and again. This is why I say "will take longer". On the other hand, if you happen to have a working shell, you'll be able to diagnose this in one shot. It's worth trying. — Nikita Kipriyanov, Mar 09 '23 at 17:54
@NikitaKipriyanov I am going to create a page to run the ssh commands from `php`. I have php running as root when a client-side SSL certificate is provided so I should be able to run those commands at the time of error and echo to the screen. I'll have a look at `Zabbix` and `Nagios`. I would expect anything that requires writing to disk at the time of error to fail though. Would have to send live data to the screen. — Dan Bray, Mar 09 '23 at 18:07
I see this is going nowhere. Just connect beforehand and hope it will survive (it should, disk overflow doesn't break running ssh). Really. Don't assume, do things. — Nikita Kipriyanov, Mar 09 '23 at 18:12
@NikitaKipriyanov I believe you. I only said I wasn't certain. Anyway, I am going to create a diagnostic page in php that can run the commands, simply because it's something I should have anyway (my app has already evolved into a cPanel). `I see this is going nowhere` - You've given me some good ideas that I can try so this is going somewhere. It might take a while though because I have no idea how long it will be before I encounter the error again. You could even put what you suggested as an answer, although I can't accept it until I've successfully diagnosed the problem. — Dan Bray, Mar 09 '23 at 18:39

How to diagnose out of space errors?

0 Answers0