We have a 200GB sparse file which is about 80GB in actual size (VMware disk).
- How does Google calculate the space for this file, 200GB or 80GB?
Google Cloud Storage does not introspect your files to understand what they are, so it's the actual size (80GB) that it takes on disk that matters.
- What would be the best practice to store it in the Google Cloud using
gsutil
(similar to rsync -S
)
There's gsutil rsync
but it does not support -S so that won't be very efficient. Also, Google Cloud Storage is not storing files as blocks which can be accessed and rewritten randomly, but as blobs keyed by the bucket name + object name, so you'll essentially be uploading the entire 80GB file every time.
One alternative you might consider is to use Persistent Disks which provide block-level access to your files with the following workflow:
One-time setup:
- create a persistent disk and use it only for storage of your VM image
Pre-sync setup:
- create a Linux VM instance with its own boot disk
- attach the persistent disk in read-write mode to the instance
- mount the attached disk as a file system
Synchronize:
- use ssh+rsync to synchronize your VM image to the persistent disk on the VM
Post-sync teardown:
- unmount the disk within the instance
- detach the persistent disk from the instance
- delete the VM instance
You can automate the setup and teardown steps with scripts so it should be very easy to run on a regular basis whenever you want to do the synchronization.
- Would it be solved by using
tar cSf
, and then upload via gsutil
? How slow could it be?
The method above will be limited by your network connection, and would be no different from ssh+rsync to any other server. You can test it out by, say, throttling your bandwidth artificially to another server on your own network to match your external upload speed and running rsync over ssh to test it out.
Something not covered above is pricing, so I'll just leave these pointers for you to consider that as well, as that may be relevant for you to consider in your analysis.
Using Google Cloud Storage mode, you'll incur:
Using the Persistent Disk approach, you'll incur:
The actual amount of data you will download should be small, since that's what rsync is supposed to minimize, so most of the data should be uploaded rather than downloaded, and hence your network cost should be low, but that is based on the actual rsync implementation which I cannot speak for.
Hope this helps.