How are sparse files handled in Google Cloud Storage?

Question

We have a 200GB sparse file which is about 80GB in actual size (VMware disk).

How does Google calculate the space for this file, 200GB or 80GB?
What would be the best practice to store it in the Google Cloud using gsutil (similar to rsync -S)
Would it be solved by using tar cSf, and then upload via gsutil? How slow could it be?

I answered your question below as stated, but it might help to describe your actual use case/goals in more detail as there might be alternative solutions to it as well. For example, what are you doing specifically? Is this just a backup solution? Are you mutating the image offline and uploading it? Can you do all your operations in the cloud and thus avoid the massive file uploads? Consider also other techniques like containers, etc. — Misha Brukman, Aug 20 '14 at 01:49

score 1 · Answer 1 · answered Aug 20 '14 at 01:46

We have a 200GB sparse file which is about 80GB in actual size (VMware disk).

How does Google calculate the space for this file, 200GB or 80GB?

Google Cloud Storage does not introspect your files to understand what they are, so it's the actual size (80GB) that it takes on disk that matters.

What would be the best practice to store it in the Google Cloud using gsutil (similar to rsync -S)

There's gsutil rsync but it does not support -S so that won't be very efficient. Also, Google Cloud Storage is not storing files as blocks which can be accessed and rewritten randomly, but as blobs keyed by the bucket name + object name, so you'll essentially be uploading the entire 80GB file every time.

One alternative you might consider is to use Persistent Disks which provide block-level access to your files with the following workflow:

One-time setup:

create a persistent disk and use it only for storage of your VM image

Pre-sync setup:

create a Linux VM instance with its own boot disk
attach the persistent disk in read-write mode to the instance
mount the attached disk as a file system

Synchronize:

use ssh+rsync to synchronize your VM image to the persistent disk on the VM

Post-sync teardown:

unmount the disk within the instance
detach the persistent disk from the instance
delete the VM instance

You can automate the setup and teardown steps with scripts so it should be very easy to run on a regular basis whenever you want to do the synchronization.

Would it be solved by using tar cSf, and then upload via gsutil? How slow could it be?

The method above will be limited by your network connection, and would be no different from ssh+rsync to any other server. You can test it out by, say, throttling your bandwidth artificially to another server on your own network to match your external upload speed and running rsync over ssh to test it out.

Something not covered above is pricing, so I'll just leave these pointers for you to consider that as well, as that may be relevant for you to consider in your analysis.

Using Google Cloud Storage mode, you'll incur:

Google Cloud Storage pricing: currently $0.026 / GB / month
Network egress (ingress is free): varies by total amount of data

Using the Persistent Disk approach, you'll incur:

Persistent Disk pricing: currently $0.04 / GB / month
VM instance: needs to be up only while you're running the sync
Network egress (ingress is free): varies by total amount of data

The actual amount of data you will download should be small, since that's what rsync is supposed to minimize, so most of the data should be uploaded rather than downloaded, and hence your network cost should be low, but that is based on the actual rsync implementation which I cannot speak for.

Hope this helps.

How are sparse files handled in Google Cloud Storage?

1 Answers1