When I mmap to a file in a mounted storage bucket, how/when is it downloaded?

Question

I'm using a library that mmaps a large resource file. I'm considering storing that resource file in a gcloud bucket and using GCSFuse to make the file available for mmapping, instead of building my own solution to manually download the file.

For performance reasons I want to know when the file is actually downloaded when I mmap to a file in a bucket over gcsfuse: if it's downloaded all at once when I mmap that's ideal. If chunks are downloaded as I access different parts of the file via the mmapped pointer I imagine that's going to be slower due to multiple calls to the bucket, and I'd likely use another method if that is the case.

score 0 · Accepted Answer · answered Feb 23 '19 at 14:31

This is an implementation specific detail, so be sure to read the documentation. README.md as of 6ab0a79 has this to say:

Downloading object contents

Behind the scenes, when a newly-opened file is first modified, gcsfuse downloads the entire backing object's contents from GCS. The contents are stored in a local temporary file whose location is controlled by the flag --temp-dir. Later, when the file is closed or fsync'd, gcsfuse writes the contents of the local file back to GCS as a new object generation.

Files that have not been modified are read portion by portion on demand. gcsfuse uses a heuristic to detect when a file is being read sequentially, and will issue fewer, larger read requests to GCS in this case.

The consequence of this is that gcsfuse is relatively efficient when reading or writing entire large files, but will not be particularly fast for small numbers of random writes within larger files, and to a lesser extent the same is true of small random reads. Performance when copying large files into GCS is comparable to gsutil (see issue #22 for testing notes). There is some overhead due to the staging of data in a local temporary file, as discussed above.

Note that new and modified files are also fully staged in the local temporary directory until they are written out to GCS due to being closed or fsync'd. Therefore the user must ensure that there is enough free space available to handle staged content when writing large files.

Note the bit about a write downloading the entire thing, full object writes being more efficient, and the surprising behaviors in semantics.md. It will be more efficient to skip the fuse file system layer, and directly read and write entire chunks of your data as storage blobs with the GCS SDK. But that is a significant change to how this app uses storage.

When I mmap to a file in a mounted storage bucket, how/when is it downloaded?

1 Answers1

Downloading object contents