2

Boehm gc only deal with memory allocation. But if one wants to use garbage collection to deal with fopen() so that fclose() is no longer needed. Is there a way to do so in C?

P.S. For example, PyPy takes the garbage collection approach to deal with opening files.

The most obvious effect of this is that files (and sockets, etc) are not promptly closed when they go out of scope. For files that are opened for writing, data can be left sitting in their output buffers for a while, making the on-disk file appear empty or truncated.

http://doc.pypy.org/en/latest/cpython_differences.html

rici
  • 234,347
  • 28
  • 237
  • 341
user1424739
  • 11,937
  • 17
  • 63
  • 152

2 Answers2

3

In case it's not obvious, nothing Boehm GC does is possible in C. The whole library is a huge heap of undefined behavior that kinda happens to work on some (many?) real-world implementations. The more advanced, especially in the area of safety, C implementations get, the less likely any of it is to continue to work.

With that said, I don't see any reason the same principle couldn't be extended to FILE* handles. The problem, however, is that with it necessarily being a conservative GC, false positives for remaining references would prevent the file from being closed, and that has visible consequences on the state of the process and the filesystem. If you explicitly fflush in the right places, it might be acceptably only-half-broken, though.

There's absolutely no meaningful way to do this with file descriptors, on the other hand, because they are small integers. You'll essentially always have false positives for remaining references.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
0

TL;DR: Yes, but. More but than yes.

First things first. Since the standard C library must itself automatically garbage collect open file handles in the exit() function (see standard quotes below), it is not necessary to ever call fclose as long as:

  1. You are absolutely certain that your program will eventually terminate either by returning from main() or by calling exit().

  2. You don't care how much time elapses before the file is closed (making data written to the file available to other processes).

  3. You don't need to be informed if the close operation failed (perhaps because of disk failure).

  4. Your process will not open more than FOPEN_MAX files, and will not attempt to open the same file twice. (FOPEN_MAX must be at least eight, but that includes the three standard streams.)

Of course, aside from very simple toy applications, those guarantees are pretty restrictive, particularly for files opened for writing. For a start, how are you going to guarantee that the host does not crash or get powered down (voiding condition 1)? So most programmers regard it as very bad style to not close all open files.

All the same, it is possible to imagine an application which only opens files for reading. In that case, the most serious issue with never calling fclose will be the last one, the simultaneous open file limit. Five is a pretty small number, and even though most systems have much higher limits, they almost all have limits; if an application runs long enough, it will inevitably open too many files. (Condition 3 might be a problem, too, although not all operating systems impose this limit, and few systems impose the limit on files opened only for reading.)

As it happens, these are precisely the issues that garbage collection can, in theory, help solve. With a bit of work, it is possible to get a garbage collector to help manage the number of simultaneously open files. But... as mentioned, there are a number of Buts. Here's a few:

  1. The standard library is under no obligation to dynamically allocate FILE objects using malloc, or indeed to dynamically allocate them at all. (A library which only allowed eight open files might have an internal statically allocated array of eight FILE structures, for example.) So the garbage collector might never see the storage allocations. In order to involve the garbage collector in the removal of FILE objects, every FILE* needs to be wrapped inside a dynamically-allocated proxy (a "handle"), and every interface which takes or returns FILE* pointers must be wrapped with one which creates a proxy. That's not too much work, but there are a lot of interfaces to wrap and the use of the wrappers basically relies on source modification; you might find it difficult to introduce FILE* proxies if some files are opened by external library functions.

  2. Although the garbage collector can be told what to do before it deletes certain objects (see below), most garbage collector libraries have no interface which provides for an object creation limit other than the availability of memory. The garbage collector can only solve the "too many open files" problem if it knows how many files are allowed to be open simultaneously, but it doesn't know and it doesn't have a way for you tell it. So you have to arrange for the garbage collector to be called manually when this limit is about to be breached. Of course, since you are already wrapping all calls to fopen, as per point 1, you can add this logic to your wrapper, either by tracking the open file count, or by reacting to an error indication from fopen(). (The C standard doesn't specify a portable mechanism for detecting this particular error, but Posix says that fopen should fail and set errno to EMFILE if the process has too many files open. Posix also defines the ENFILE error value for the case where there are too many files open in total over all processes; it's probably worthwhile to consider both of these cases.)

  3. In addition, the garbage collector doesn't have a mechanism to limit garbage collection to a single resource type. (It would be very difficult to implement this in a mark-sweep garbage collector, such as the BDW collector, because all used memory needs to be scanned to find live pointers.) So triggering garbage collection whenever all file descriptor slots are used up could turn out to be quite expensive.

  4. Finally, the garbage collector does not guarantee that garbage will be collected in a timely manner. If there is no resource pressure, the garbage collector could stay dormant for a long time, and if you are relying on the garbage collector to close your files, that means that the files could remain open for an unlimited amount of time even though they are no longer in use. So the first two conditions in the original list of requirements for omitting fclose() continue to be in force, even with a garbage collector.

So. Yes, but, but, but, but. Here's what the Boehm GC documentation recommends (abbreviated):

  • Actions that must be executed promptly… should be handled by explicit calls in the code.
  • Scarce system resources should be managed explicitly whenever convenient. Use [garbage collection] only as a backup mechanism for the cases that would be hard to handle explicitly.
  • If scarce resources are managed with [the garbage collector], the allocation routine for that resource (e.g. open file handles) should force a garbage collection (two if that doesn't suffice) if it finds itself short of the resource.
  • If extremely scarce resources are managed (e.g. file descriptors on systems which have a limit of 20 open files), it may be necessary to introduce a descriptor caching scheme to hide the resource limit.

Now, suppose you've read all of that, and you still want to do it. It's actually pretty simple. As mentioned above, you need to define a proxy object, or handle, which holds a FILE*. (If you are using Posix interfaces like open() which use file descriptors -- small integers -- instead of FILE structures, then the handle holds the fd. This is a different object type, obviously, but the mechanism is identical.)

In your wrapper for fopen() (or open(), or any of the other calls which return open FILE*s or files), you dynamically allocate a handle, and then (in the case of the Boehm GC) call GC_register_finalizer to tell the garbage collector what function to call when the resource is about to be deleted. Almost all GC libraries have some such facility; search for finalizer in their documentation. Here's the documentation for the Boehm collector, out of which I extracted the list of warnings above.

Watch out to avoid race conditions when you are wrapping the open call. The recommended practice is as follows:

  1. Dynamically allocate the handle.
  2. Initialize its contents to a sentinel value (such as -1 or NULL) which indicates that the handle has not yet been assigned to an open file.
  3. Register a finalizer for the handle. The finalizer function should check for the sentinel value before attempting to call fclose(), so registering the handle at this point is fine.
  4. Open the file (or other such resource).
  5. If the open succeeds, reset the handle to use the returned from the open. If the failure has to do with resource exhaustion, trigger a manual garbage collection and repeat as necessary. (Be careful to limit the number of times you do that for a single open wrapper. Sometimes you need to do it twice, but three consecutive failures probably indicates some other kind of problem.)
  6. If the open eventually succeeded, return the handle. Otherwise, optionally deregister the finalizer (if your GC library allows that) and return an error indication.

Obligatory C standard quotes

  1. Returning from main() is the same as calling exit()

    §5.1.2.2.3 (Program termination): (Only applies to hosted implementations)

    1. If the return type of the main function is a type compatible with int, a return from the initial call to the main function is equivalent to calling the exit function with the value returned by the main function as its argument; reaching the } that terminates the main function returns a value of 0.
  2. Calling exit() flushes all file buffers and closes all open files

    §7.22.4.4 (The exit function):

    1. Next, all open streams with unwritten buffered data are flushed, all open streams are closed, and all files created by the tmpfile function are removed…
Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341