Can write(2) return 0 bytes written*, and what to do if it does?

Question

I'd like to implement a proper write(2) loop that takes a buffer and keeps calling write until the entire buffer is written.

I guess the basic approach is something like:

/** write len bytes of buf to fd, returns 0 on success */
int write_fully(int fd, char *buf, size_t len) {
  while (len > 0) {
    ssize_t written = write(fd, buf, len);
    if (written < 0) {
      // some kind of error, probably should try again if its EINTR?
      return written;
    }
    buf += written;
    len -= written;
  }
  return 0;
}

... but this raises the question of whether write() can validly return 0 bytes written and what to do in this case. If the situation persists, the above code will just hot spin on the write call which seems like a bad idea. As long as something other than zero is returned you are making forward progress.

The man page for write is a bit ambiguous. It says, for example:

On success, the number of bytes written is returned (zero indicates nothing was written).

Which seems to indicate that it is possible in some scenarios. Only one such scenario is explicitly called out:

If count is zero and fd refers to a regular file, then write() may return a failure status if one of the errors below is detected. If no errors are detected, or error detection is not performed, 0 will be returned without causing any other effect. If count is zero and fd refers to a file other than a regular file, the results are not specified.

That case is avoided above because I never call write with len == 0. There are lot of other cases where nothing could be written, but in general they all have specific error codes associated with them.

The file itself will be a opened from a path/name given on the command line. So it will usually be a regular file, but users may of course pass things like pipes, do input redirection, pass special devices like /dev/stdout and so on. I am at least in control of the open call and the O_NONBLOCK flag is not passed to open. I can't reasonably check the behavior for all the file systems, all the special devices (and even if I could, more will be added), so I want to know how to handle this in a reasonable and general way.

* ... for a non-zero buffer size.

I don't think it's guaranteed by POSIX, but I can't think of any scenarios where a blocking descriptor would write 0 bytes. — Barmar, Jan 27 '17 at 22:54
I think the spec is deliberately silent on this because device drivers can do almost arbitrary things, and they didn't want to preclude drivers that return 0 for some reason. — Barmar, Jan 27 '17 at 22:59
POSIX doesn't seem to be very clear about this. It requires that errors (-1) be returned for nonblocking pipes and fifos that are full and for interrupted writes that didn't write anything, but I couldn't find any actual prohibition of the 0 return value for nonzero write requests. I think it's handle the 0 case too in case the OS is crazy enough to ever return it. — Petr Skocik, Jan 27 '17 at 23:00
@PSkocik - but how can I _handle_ it? It is not clear what do do. Indeed, the call may keep returning 0 forever, right? I could treat it like a fatal error, I suppose... — BeeOnRope, Jan 27 '17 at 23:03
@BeeOnRope I'd just keep looping. In the very unlikely scenario it does come up, it should be transient. If it isn't the OS is nuts and it's not your fault. — Petr Skocik, Jan 27 '17 at 23:09
@BeeOnRope BTW, ditch the above code and use this instead: http://poincare.matf.bg.ac.rs/~ivana/courses/tos/sistemi_knjige/pomocno/apue/APUE/0201433079/ch14lev1sec8.html — Petr Skocik, Jan 27 '17 at 23:13
@PSkocik - looks like essentially the same code, with slightly different error handling semantics, and with an unknown license and copyright. Also, FWIW, it doesn't take your advice of looping forever - it exits with a partial write on a zero length `write` return. Finally, it does arithmetic on `void *` which is [illegal](http://stackoverflow.com/questions/3523145/pointer-arithmetic-for-void-pointer-in-c). — BeeOnRope, Jan 27 '17 at 23:17
@BeeOnRope Fair point, but at least it's returning the number of bytes actually written so you can actually try again and know where to try again from. — Petr Skocik, Jan 27 '17 at 23:23
If you want to prevent infinitely waiting, then there is only one thing you can do: increment a counter and exit when it reaches a certain threshold. But: Can the delay be due to some real time condition? (For example, a slow printer.) In that case, you may need to wait for *seconds*. — Jongware, Jan 27 '17 at 23:23
@PSkocik - in my case, any failure to write the full file is a fatal error, so I went with the simple approach. In any case, the idea is that all retryable errors are handled by this method, so it shouldn't return if the caller should retry (that just leads to writing another loop like method does - we don't need doubly-nested loops here!). It either writes the whole file or has failed... — BeeOnRope, Jan 27 '17 at 23:26
Toybox does this with `xwrite()` see: https://github.com/landley/toybox/blob/master/lib/xwrap.c#L434 and `writeall()` https://github.com/landley/toybox/blob/master/lib/lib.c#L120 — technosaurus, Jan 31 '17 at 19:48
@BeeOnRope: I wish your question would not ask *"can"*, but rather about the various strategies as to what to do when `read()` or `write()` returns an unexpected value -- just because the C standard (or, say POSIX) says that something should never happen, does not mean it does not happen in real life. I personally use three different strategies with `write()` (using blocking descriptors, i.e. excluding nonblocking stuff), depending on whether it is writing important data, trivial information, or error messages. It may be paranoid, but it works very well for me and my data. — Nominal Animal, Feb 02 '17 at 22:43
@NominalAnimal - that was actually my question, although perhaps I wasn't totally clear. I updated the title to make it clearer. It's important to be what the standards say - if it isn't allowed and it occurs with some weird device or file type, I'll just die with a fatal error, but if it's more like "yes, it can happen, and here's a scenario and how to handle it, I'd rather do that". — BeeOnRope, Feb 02 '17 at 22:53
@BeeOnRope: There used to be a Linux kernel bug where some filesystems would return an invalid count if writes over 2GB were attempted. (Because of this, single `write()`s are now capped at under 2GB.) In my opinion, the mitigation or error handling strategy should depend on *what kind* of data is being written -- a single approach is as useful as a hammer. For example, if writing important data, I'd consider `0` the same as `-EIO`. For other types, I might retry (as for `-EINTR` or `-EWOULDBLOCK`), perhaps just once. So, *purpose* of the write matters for me, and one approach is not enough. — Nominal Animal, Feb 03 '17 at 04:00

Jonathan Leffler · Accepted Answer · 2021-04-15T20:14:17.640

TL;DR summary

Unless you go out of your way to invoke unspecified behaviour, you will not get a zero result back from write() unless, perhaps, you attempt to write zero bytes (which the code in the question avoids doing).

POSIX says:

The POSIX specification for write() covers the issue, I believe.

The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.

Before any action described below is taken, and if nbyte is zero and the file is a regular file, the write() function may detect and return errors as described below. In the absence of errors, or if error detection is not performed, the write() function shall return zero and have no other results. If nbyte is zero and the file is not a regular file, the results are unspecified.

This states that if you request a write of zero bytes, you may get a return value of zero, but there are a bundle of caveats — it must be a regular file, and you might get an error if errors like EBADF are detected, and it is unspecified what happens if the file descriptor does not refer to a regular file.

If a write() requests that more bytes be written than there is room for (for example, [XSI]⌦ the file size limit of the process or ⌫ the physical end of a medium), only as many bytes as there is room for shall be written. For example, suppose there is space for 20 bytes more in a file before reaching a limit. A write of 512 bytes will return 20. The next write of a non-zero number of bytes would give a failure return (except as noted below).

[XSI]⌦ If the request would cause the file size to exceed the soft file size limit for the process and there is no room for any bytes to be written, the request shall fail and the implementation shall generate the SIGXFSZ signal for the thread. ⌫

If write() is interrupted by a signal before it writes any data, it shall return -1 with errno set to [EINTR].

If write() is interrupted by a signal after it successfully writes some data, it shall return the number of bytes written.

If the value of nbyte is greater than {SSIZE_MAX}, the result is implementation-defined.

These rules do not really give permission to return 0 (though a pedant might say that a value of nbyte that's too large might be defined to return 0).

When attempting to write to a file descriptor (other than a pipe or FIFO) that supports non-blocking writes and cannot accept the data immediately:

If the O_NONBLOCK flag is clear, write() shall block the calling thread until the data can be accepted.

If the O_NONBLOCK flag is set, write() shall not block the thread. If some data can be written without blocking the thread, write() shall write what it can and return the number of bytes written. Otherwise, it shall return -1 and set errno to [EAGAIN].

…details for obscure file types — a number of them with unspecified behaviour…

Return value

Upon successful completion, these functions shall return the number of bytes actually written to the file associated with fildes. This number shall never be greater than byte. Otherwise, -1 shall be returned and errno set to indicate the error.

So, since your code avoids attempting to write zero bytes, as long as len is not larger than {SSIZE_MAX}, and as long as you aren't writing to obscure file types (like a shared memory object or a typed memory object) you should not see zero returned by write().

POSIX Rationale says:

Later in the POSIX page for write(), in the Rationale section, there is the information:

Where this volume of POSIX.1-2008 requires -1 to be returned and errno set to [EAGAIN], most historical implementations return zero (with the O_NDELAY flag set, which is the historical predecessor of O_NONBLOCK, but is not itself in this volume of POSIX.1-2008). The error indications in this volume of POSIX.1-2008 were chosen so that an application can distinguish these cases from end-of-file. While write() cannot receive an indication of end-of-file, read() can, and the two functions have similar return values. Also, some existing systems (for example, Eighth Edition) permit a write of zero bytes to mean that the reader should get an end-of-file indication; for those systems, a return value of zero from write() indicates a successful write of an end-of-file indication.

Thus, although POSIX (largely if not wholly) precludes the possibility of a zero return from write(), there was prior art on related systems that did have write() return zero.

I agree with your conclusion of *you should not see zero returned by `write()`*, but I just can't convince myself that the standard *precludes* returning zero. How hard is it to reliably handle a zero-byte `write()` result? — Andrew Henle, Feb 01 '17 at 01:48
While coding is inherently *pedantic* it is not *procrustean* (meaning 2) `:)` — David C. Rankin, Feb 02 '17 at 22:59
@AndrewHenle: How would/could you handle a zero-length write? Should you retry, perhaps after a small pause, to see if space is now available that was not available before? My take would be 'report it as an error via the appropriate logging mechanism and return value' and see whether it is actually a problem in practice. I'd expect it not to be a problem. If it turns out to be a problem, your logging will have identified it (as well as the calling code handling a write error that, presumably, shouldn't actually be a write error). Personally, I don't think I'd worry about having that occur. — Jonathan Leffler, Feb 02 '17 at 23:06
Thanks, this was the most conclusive answer and it deserved the full bounty, but I had a period unexpectedly without Internet access right as it expired. So well enjoy your half bounty. — BeeOnRope, Feb 08 '17 at 21:53
Thanks. Funnier things have happened, but I'm grateful for the explanation. I last missed some time on SO because of the unexpected absence of internet connectivity (in Darkest Northern England at the time). — Jonathan Leffler, Feb 08 '17 at 21:58

score 6 · Answer 2 · answered Jan 28 '17 at 04:50

It depends on what the file descriptor refers to. When you call write on a file descriptor, the kernel ultimately ends up calling the write routine in the associated file operations vector, which corresponds to the underlying file system or device that the file descriptor refers to.

Most normal file systems will never return 0, but devices might do just about anything. You need to look at the documentation for the device in question to see what it might do. It is legal for a device driver to return 0 bytes written (the kernel won't flag it as an error or anything), and if it does, the write system call will return 0.

Thanks. There is no particular device in question - it's just a file open from a path passed on the command line. Of course, the user might pass whatever. I suppose the reasonable thing to do is to try once more after a 0 write and if you get another one treat it as an unrecoverable write error. — BeeOnRope, Jan 28 '17 at 04:53

score 3 · Answer 3 · answered Feb 01 '17 at 00:33

3

Posix defines it for pipes, FIFOs, and FDs that support non-blocking operations, in the case that nbyte (the third parameter) is positive and the call wasn't interrupted:

if O_NONBLOCK is clear ... it shall return nbyte.

In other words not only can it not return 0 unless nbyte was zero, it can't return a short length either, in the cases mentioned.

answered Feb 01 '17 at 00:33

user207421

305,947
44
307
483

1

There are similar guarantees even when `O_NONBLOCK` is set. For fds other than pipes and FIFOs, POSIX says: "[if no data could be written without blocking the thread], it shall return -1 and set errno to [EAGAIN]" which clearly means that zero cannot be returned. For pipes and FIFOs some of the details of what can be returned are different depending on how large the write request is. But the bottom line though is that if no data can be written immediately the return must be `-1`. See http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html for details. – Michael Burr Feb 01 '17 at 00:54

LSerni · Answer 4 · 2017-02-01T10:32:09.477

I think that the only feasible approach (apart from ignoring the problem altogether, which seems the thing to do according to the documentation) is to allow "spinning in place".

You can implement a retry count, but if this extremely unlikely "0 return with nonzero length" is due to some transient situation - a LapLink queue full maybe; I remember that driver doing weird things - the loop will probably be so fast that any reasonable retry count would be overwhelmed anyway; and a unreasonably large retry count is not advisable in case you have other devices that instead take a non-negligible time to return 0.

So I'd try something like this. You might want to use gettimeofday() instead, for greater precision.

(We're introducing a negligible performance penalty for an event that seems to have a negligible chance of ever happening).

/** write len bytes of buf to fd, returns 0 on success */
int write_fully(int fd, char *buf, size_t len) {
  time_t timeout = 0;
  while (len > 0) {
    ssize_t written = write(fd, buf, len);
    if (written < 0) {
      // some kind of error, probably should try again if its EINTR?
      return written;
    }

      if (!written) {
          if (!timeout) {
              // First time around, set the timeout
              timeout = time(NULL) + 2; // Prepare to wait "between" 1 and 2 seconds
              // Add nanosleep() to reduce CPU load
          } else {
              if (time(NULL) >= timeout) {
                  // Weird status lasted too long
                  return -42;
              }
          }
      } else {
          timeout = 0; // reset timeout at every success, or the second zero-return after a recovery will immediately abort (which could be desirable, at that).
      }

    buf += written;
    len -= written;
  }
  return 0;
}

Why not use `sleep()` or `nanosleep()` to avoid the 100% cpu load? — chqrlie, Feb 01 '17 at 08:06
Thanks Stargateur and @chqrlie , both excellent suggestions. I've left nanosleep as a comment though, since the delay might or might not be desirable. Also, I'm not too comfortable with adding features to what ought to be an unlikely border case. — LSerni, Feb 01 '17 at 10:35

score 1 · Answer 5 · answered Feb 04 '17 at 21:47

I personally use several approaches to this problem.

Below are three examples, which all expect to work on a blocking descriptor. (That is, they consider EAGAIN/EWOULDBLOCK an error.)

When saving important user data, without a time limit (and thus the write not supposed to be interrupted by signal delivery), I prefer to use

#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

int write_uninterruptible(const int descriptor, const void *const data, const size_t size)
{
    const unsigned char       *p = (const unsigned char *)data;
    const unsigned char *const q = (const unsigned char *)data + size;
    ssize_t                    n;

    if (descriptor == -1)
        return errno = EBADF;

    while (p < q) {

        n = write(descriptor, p, (size_t)(q - p));
        if (n > 0)
            p += n;
        else
        if (n != -1)
            return errno = EIO;
        else
        if (errno != EINTR)
            return errno;
    }

    if (p != q)
        return errno = EIO;

    return 0;
}

This will abort if an error (other than EINTR) will occur, or if write() returns zero or a negative value other than -1.

Because there is no sane reason for the above to return the partial write count, it instead returns 0 if success, and nonzero errno error code otherwise.

When writing important data, but the write is to be interrupted if a signal is delivered, the interface is a bit different:

size_t write_interruptible(const int descriptor, const void *const data, const size_t size)
{
    const unsigned char       *p = (const unsigned char *)data;
    const unsigned char *const q = (const unsigned char *)data + size;
    ssize_t                    n;

    if (descriptor == -1) {
        errno = EBADF;
        return 0;
    }

    while (p < q) {

        n = write(descriptor, p, (size_t)(q - p));
        if (n > 0)
            p += n;
        else
        if (n != -1) {
            errno = EIO;
            return (size_t)(p - (const unsigned char *)data);
        } else
            return (size_t)(p - (const unsigned char *)data);
    }

    errno = 0;
    return (size_t)(p - (const unsigned char *)data);
}

In this case, the amount of data written is always returned. This version also sets errno in all cases -- normally errno is not set except in error cases.

Although this means that if an error occurs partway through, and the function will return the amount of data that was successfully written (with prior write() calls), the reason for always setting errno is to make error detection easier, essentially to separate the status (errno) from the write count.

Occasionally, I need a function that writes a debugging message to standard error from a signal handler. (Standard <stdio.h> I/O is not async-signal safe, so a special function is needed then in any case.) I want that function to abort even on signal delivery -- it's no big deal if the write fails, as long as it does not futz with the rest of the program --, but keep errno unchanged. This prints strings exclusively, as it is the intended use case. Note that strlen() is not async-signal safe, so an explicit loop is used instead.

int stderr_note(const char *message)
{
    int retval = 0;

    if (message && *message) {
        int         saved_errno;
        const char *ends = message;
        ssize_t     n;

        saved_errno = errno;
        while (*ends)
            ends++;

        while (message < ends) {
            n = write(STDERR_FILENO, message, (size_t)(ends - message));
            if (n > 0)
                message += n;
            else {
                if (n == -1)
                    retval = errno;
                else
                    retval = EIO;
                break;
            }
        }

        if (!retval && message != ends)
            retval = EIO;

        errno = saved_errno;
    }

    return retval;
}

This version returns 0 if the message was successfully written to standard output, and a nonzero error code otherwise. As mentioned, it always keeps errno unchanged, to avoid unexpected side effects in the main program if used in a signal handler.

I use very simple principles when dealing with unexpected errors or return values from syscalls. The main principle is to never silently discard or mangle user data. If data is lost or mangled, the program should always notify the user. Everything unexpected should be considered an error.

Only some of the writes in a program involve user data. A lot is informational, like usage information, or a progress report. For those, I'd prefer to either ignore the unexpected condition, or skip that write altogether. It depends on the nature of the data written.

In summary, I do not care what the standards say about the return values: I handle them all. The response to each (type of) result depends on the data being written -- specifically, the importance of that data to the user. Because of this, I do use several different implementations even in a single program.

Art · Answer 6 · 2017-02-01T11:38:02.210

I would say that the whole question is unnecessary. You are simply being too careful. You expect the file to be a regular file, not a socket, not a device, not a fifo, etc. I would say that any return from a write to a regular file that isn't equal to len is an unrecoverable error. Don't try to fix it. You probably filled the filesystem, or your disk is broken, or something like that. (this all assumes that you haven't configured your signals to interrupt system calls)

For regular files I don't know any kernel that doesn't already do all the necessary retrying to get your data written and if that fails the error is most likely severe enough that it is beyond the application to fix it. If the user decides to pass a non-regular file as argument, so what? It's their problem. Their foot and their gun, let them shoot it.

By trying to fix this in your code you are much more likely to make things worse by creating an endless loop eating CPU or filling the filesystem journal or just hang.

Don't handle 0 or other short writes, just print an error on any return other than len and exit. Once you get a proper bug report from a user that actually has a legitimate reason for the writes to fail, fix it then. Most likely this will never happen because this is what almost everyone does.

Yes, sometimes it's fun to read POSIX and find the edge cases and write code to deal with them. But operating system developers don't get sent to prison for violating POSIX, so even if your clever code perfectly matches what the standard says that is no guarantee that things will then always work. Sometimes it's better to just get things done and rely of being in good company when they break. If regular file writes start returning short you'll be in such a good company that most likely it will get fixed long before any of your users notice.

N.B. Almost 20 years ago I worked on a filesystem implementation and we tried to be standards lawyers about the behavior of one of the operations (not write, but the same principle applies). Our "it is legal to return the data in this order" was silenced by the deluge of bug reports of broken applications that expected things in a certain way and in the end it was simply faster to just fix it instead of fighting the same fight in each and every bug report. For anyone who wonders, lots of things back then (and probably still today) expected readdir to return . and .. as the first two entries in a directory which (at least back then) wasn't mandated by any standard.

To be clear, I'm not trying to "fix it", necessarily. The code needs to do _something_ and once you are aware of it, even if you don't change your code, you are always doing *something* in the event of 0 length `write`: just eyeball your code and see what it does! Typical code (that checks for -1 or negative as error), I think, will _default_ to looping forever, so it's weird you suggest that as an outcome of "trying to fix it". — BeeOnRope, Feb 08 '17 at 21:56
@BeeOnRope Typical code that sees any return value other than len will print an error message and exit. This is what pretty much everyone does. Because it's never supposed to happen and if it does it is beyond the program to deal with it. — Art, Feb 09 '17 at 05:58
On the contrary return values less than `len` but greater than zero are very common and have to be supported! If the file is a socket, pipe or any number of other things, partial writes are common, and can simply be retried after adjusting the buffer for the written length. For a program that opens files passed on the command line, you have to accommodate those since the shell allows passing of varies pipes and redirects and special devices. That's exactly why you loop until everything is written. For zero length returns, however, the case is unclear. — BeeOnRope, Feb 09 '17 at 17:03

Can write(2) return 0 bytes written*, and what to do if it does?

6 Answers6

TL;DR summary

POSIX says:

Return value

POSIX Rationale says: