6

I have some big, big files that I work with and I use several different I/O functions to access them. The most common one is the bigmemory package.

When writing to the files, I've learned the hard way to flush output buffers, otherwise all bets are off on whether the data was saved. However, this can lead to some very long wait times while bigmemory does its thing (many minutes). I don't know why this happens - it doesn't always occur and it's not easily reproduced.

Is there some way to determine whether or not I/O buffers have been flushed in R, especially for bigmemory? If the operating system matters, then feel free to constrain the answer in that way.

If an answer can be generalized beyond bigmemory, that would be great, as I sometimes rely on other memory mapping functions or I/O streams.

If there are no good solutions to checking whether buffers have been flushed, are there cases in which it can be assumed that buffers have been flushed? I.e. besides using flush().

Update: I should clarify that these are all binary connections. @RichieCotton noted that isIncomplete(), though the help documentation only mentions text connections. It's not clear if that is usable for binary connections.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • Not sure about use with `bigmemory`, but `isIncomplete` works for regular connections. – Richie Cotton Aug 08 '11 at 22:04
  • Thanks! The very limited help info on connections only mentions that isIncomplete is suitable for output of text connections. Have you had luck with binary connections? – Iterator Aug 08 '11 at 22:09

2 Answers2

0

Is this more convincing that isIncomplete() works with binary files?

# R process 1
zz <- file("~/test", "wb")
writeBin(c(1:100000),con=zz)
close(zz)

# R process 2
zz2 <- file("~/test", "rb")
inpp <- readBin(con=zz2, integer(), 10000)
while(isIncomplete(con2)) {Sys.sleep(1); inpp <- c(inpp, readBin(zz2),integer(), 10000)}
close(zz2)

(Modified from the help(connections) file.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks for testing this. However, unless I'm misreading that, your example only uses it in the case of input buffers. I'm not really clear that it works on the output buffers. I am not familiar enough with output buffering to determine whether or not we can test it the same way. I'm just reluctant to go beyond the documentation - if its behavior is random, rather than deterministic, then I risk a bunch of corrupted data. I've been down that road, so I'm cautious. :) – Iterator Aug 09 '11 at 00:27
  • After further testing, I don't think `isIncomplete()` works for `bigmemory` objects: It seems that the objects are pointers of some sort, rather than connections. :( – Iterator Aug 17 '11 at 19:35
  • Thanks for the suggestion and example. It turns out that in this case the buffers are handled outside of R. – Iterator Aug 22 '11 at 15:31
0

I'll put forward my own answer, but I welcome anything that is clearer.

From what I've seen so far, the various connection functions, e.g. file, open, close, flush, isOpen, and isIncomplete (among others), are based on specific connection types, e.g. files, pipes, URLs, and a few other things.

In contrast, bigmemory has its own connection type and the bigmemory object is an S4 object with a slot for a memory address for operating system buffers. Once placed there, the OS is in charge of flushing those buffers. Since it's an OS responsibility, then getting information on "dirty" buffers requires interacting with the OS, not with R.

Thus, the answer for bigmemory is "no" as the data is stored in the kernel buffer, though it may be "yes" for other connections that are handled through STDIO (i.e. stored in "user space").

For more insight on the OS / kernel side of things, see this question on SO; I am investigating a couple of programs (not just R + bigmemory) that are producing buffer flushing curiosities, and that thread helped to enlighten me about the kernel side of things.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111