9

Linux's "man close" warns (SVr4, 4.3BSD, POSIX.1-2001):

Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a previous write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.

I can believe that this error is common (at least in applications; I'm no kernel hacker). But how serious is it, today or at any point in the past three decades? In particular:

Is there a simple, reproducible example of such silent loss of data? Even a contrived one like sending SIGKILL during close()?

If such an example exists, can the data loss be handled more gracefully than just

printf("Sorry, dude, you lost some data.\n"); ?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Camille Goudeseune
  • 2,934
  • 2
  • 35
  • 56
  • Even though I generally do check the result, after many years, it does seem to come to naught. Look forward to this answer. – chux - Reinstate Monica Sep 27 '13 at 17:06
  • 1
    I generally never care about result or failure of `close`. I guess that you would care if you want to develop a very robust server software. But there are much more other possibilities of bugs :-) BTW, few free software do care about `close` failure. – Basile Starynkevitch Sep 27 '13 at 19:03
  • @BasileStarynkevitch: "*... there are much more other possibilities of bugs ...*" how right you are! :-)) – alk Sep 27 '13 at 19:57
  • 1
    Here's an interesting [LWN article](https://lwn.net/Articles/576478/) regarding checking `close()`'s return value on Linux. According to Torvalds himself, _" 'careful' users that want to hear about IO errors have to really do an fsync(), so any IO errors should show up there. Of course, checking the return value of 'close()' in addition to the fsync() is always a good idea"_ – user986730 Mar 14 '22 at 09:29

2 Answers2

9

[H]ow serious is it, today or at any point in the past three decades?

Typical applications process data. They consume some input, and produce a result. So, there are two general cases where close() may return an error: when closing an input (read-only?) file, and when closing a file that was just generated or modified.

The known situations where close() returns an error are specific to writing/flushing data to permanent storage. In particular, it is common for an operating system to cache data locally, before actually writing to the permanent storage (at close(), fsync(), or fdatasync()); this is very common with remote filesystems, and is the reason why NFS is mentioned on the man page.

I have never encountered an error while closing a read-only input file. All the cases I can think of where it might happen in real life using any of the common filesystems are ones where there is a catastrophic failure, something like kernel data structure corruption. If that happens, I think the close() error cannot be the only sign that something is terribly wrong.

When writing to a file on a remote filesystem, close()-time errors are woefully common, if the local network is prone to glitches or just drops a lot of packets. As an end user, I want my applications to tell me if there was an error when writing to a file. Usually the connection to the remote filesystem is broken altogether, and the fact that writing to a new file failed, is the first indicator to the user.

If you don't check the close() return value, the application will lie to the user. It will indicate (by a lack of an error message if not otherwise), that the file was correctly written, when in fact it wasn't, and the application was told so; the application just ignored the indication. If the user is like me, they'll be very unhappy with the application.

The question is, how important is user data to you? Most current application programmers don't care at all. Basile Starynkevitch (in a comment to the original question) is absolutely right; checking for close() errors is not something most programmers bother to do.

I believe that attitude is reprehensible; cavalier disregard for user data.

It is natural, though, because the users have no clear indication as to which application corrupted their data. In my experience the end users end up blaming the OS, hardware, open source or free software in general, or the local IT support; so, there is no pressure, social or otherwise, for a programmer to care. Because only programmers are aware of details such as this, and most programmers don't care, there is no pressure to change the status quo.

(I know saying the above will make a lot of programmers hate my guts, but at least I'm being honest. The typical response I get for pointing out things such as this is that this is such a rare occurrence, that it would be a waste of resources to check for this. That is likely true.. but I for one am willing to spend more CPU cycles and paying a few percent more to the programmers, if it means my machine actually works more predictably, and tells me if it lost the plot, rather than silently corrupts my data.)

Is there a simple, reproducible example of such silent loss of data?

I know of three approaches:

  1. Use an USB stick, and yank it out after the final write() but before the close(). Unfortunately, most USB sticks have hardware that is not designed to survive that, so you may end up bricking the USB stick. Depending on the filesystem, your kernel may also panic, because most filesystems are written with the assumption that this will never ever happen.

  2. Set up an NFS server, and simulate intermittent packet drops by using iptables to drop all packets between the NFS server and the client. The exact scenario depends on the server and client, mount options, and versions used. A test bed should be relatively easy to set up using two or three virtual machines, however.

  3. Use a custom filesystem to simulate a write error at close() time. Current kernels do not let you force-unmount tmpfs or loopback mounts, only NFS mounts, otherwise this would be easy to simulate by force-unmounting the filesystem after the final write but prior the close(). (Current kernels simply deny the umount if there are open files on that filesystem.) For application testing, creating a variant of tmpfs that returns an error at close() if the file mode indicates it is desirable (for example, other-writable but not other-readable or other-executable, ie. -??????-w-) would be quite easy, and safe. It would not actually corrupt the data, but it would make it easy to check how the application behaves if the kernel reports (the risk of) data corruption at close time.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • The USB stick scenario certainly counts as simple and everyday. And reporting the data loss, while not as happy as restoring the lost data, *is* better than silent data loss. – Camille Goudeseune Sep 30 '13 at 21:18
  • Paraphrase of a friend's remarks: POSIX used to forbid close() from returning an I/O error; it still doesn't require it. From the Linux kernel source: ext2, ext3, ext4, NTFS and FAT can't return an error; NFS can; other filesystems probably can't. (NFS never respected POSIX much, though.) So checking close() may *not* detect a prematurely removed thumbdrive. – Camille Goudeseune Oct 04 '13 at 20:09
  • 2
    @CamilleGoudeseune: In Linux, close() errors in Linux occur when `->flush` handler in the kernel filesystem-specific `struct file_operations` returns an error. On 3.11, only exofs, fuse, nfs, and cifs specify one (ecryptfs does too, but it just calls the underlying filesystem handler), so *currently* they are the only ones that can return an error during `close()`. That does not mean they never will; progress happens. On all other filesystems, a `fsync()`/`fdatasync()` is required (*for now*) to ensure the data actually hits the storage successfully, and it does not hurt even on these. – Nominal Animal Oct 05 '13 at 05:27
  • @CamilleGoudeseune: IOW, you're right: unless you mount USB sticks using fuse, you won't get a `close()` error if you yank out the USB stick prematurely. I thought this had been fixed already. This might warrant an RFC patch to LKML, actually.. – Nominal Animal Oct 05 '13 at 05:33
7

Calling POSIX's close() may lead to errno being set to:

  1. EBADF: Bad file number
  2. EINTR: Interrupted system call
  3. EIO: I/O error (from POSIX Specification Issue 6 on)

Different errors indicate different issues:

  1. EBADF indicates a programming error, as the program should have kept track of which file/socket descriptors are still open. I'd consider testing for this error a quality management action.

  2. EINTR seems to be the most difficult to handle as it is not clear whether the file/socket descriptor passed is valid after the function returned or not (under Linux it propably is not: http://lkml.org/lkml/2002/7/17/165). Observing this error you should perhaps check the program's way of handling signals.

  3. EIO is expected to appear under special conditons only, as mentioned in the man-pages. However at least just because of this one should track this error, as if it occurs most likely there something went really wrong.

All in all each of these errors has at least one good reason of being caught, so just do it! ;-)

Possible specific reactions:

  1. In terms of stability ignoring an EBADF might be acceptable, however the error shall not happen. As stated fix your code as the program does not seem to really know what it is doing.

  2. Observing an EINTR could indicate signals are running wild. This is not nice. Definitly go for the root cause. As it is unclear whether descriptors got closed or not go for a system restart asap.

  3. Running into an EIO definitly could inicate a serious failure in the hardware*1 involved. However, before the strongly recommended shutdown of the system it might be worth to simply retry the operation, although the same concerns apply as for the EINTR that it is uncertain whether the descriptor really got closed or not. In case it did got closed it is a bad idea to close it again, as it might already be in use by another thread. Go for shutdown and hardware*1 replacement asap.


*1 Hardware it to be seen in a broder sense here: An NFS server acts as a disk, so the EIO could simply due to a misconfigured server or network or whatever is involved in the NFS connection.

alk
  • 69,737
  • 10
  • 105
  • 255
  • Hmmm ... what are you supposed to do when `close` fails? Abort? Retry? Cancel? Ignore? – Jongware Sep 27 '13 at 19:19
  • 1
    @Jongware: In any case log it as a serious incident, find the root cause and fix it! Whether to "Abort, Retry, Ignore" depends on the criticality of the application, e.g. whether it's a plane or a game., whether you're the NSA or script kid. – alk Sep 27 '13 at 19:23
  • At least EBADF can't lead to data loss. EINTR and EIO sure could, but the "simple reproducible case" that I seek might involve physical destruction of hardware... – Camille Goudeseune Sep 27 '13 at 21:14
  • `EBADF` can't happen without a bug in your program. And `EINTR` can't happen without installing an interrupting signal handler. – R.. GitHub STOP HELPING ICE Jun 29 '14 at 16:35