Disk snapshot size created from a google compute engine exceeds used space

Question

I have a Google Compute Instance (VM) that has a 2TB disk and around 80GB used space. I wanted to archive this VM so that I don't get billed for the whole 2TB, and also so that it is ready to be recreated quickly if needed. Disk Snapshots seemed to be the best option since it is mentioned that I only get billed for the disk space used in that case. But when I try this, the snapshot size I get is around 600GB, almost 10 times the used space, but still less than the full 2TB.

I tried defragmenting the disk but that didn't help. I also tried using "zerofree" to write 0's to unused space, and that reduced the snapshot size to 20GB - 4x lower than the used space. However zerofree takes a lot of effort and time to run, but I'm guessing it is helping with the compression of the disk.

Is there a better way to improve disk compression efficiency in this case? Maybe any crucial step that I am missing while generating the disk snapshot?

NOTE: I also tried Machine Images but that seems to use disk snapshots under the hood, and they cost more for some reason.

I would say to use `fstrim` rather than `zerofree` but if that worked and got the size of the snapshot down, then you're done.. what exactly are you asking for? — psusi, Sep 10 '21 at 18:32
@psusi - Why do you think fstrim would help with disk snapshots. — John Hanley, Sep 10 '21 at 18:38
@psusi zerofree takes around 6-7 hours to run for a 5TB drive, and I have multiple of those. My question is if there is a more efficient way to create a snapshot-like copy of a disk, but one that does not bill me for the whole 5TB. — mrtksy, Sep 11 '21 at 16:43
@JohnHanley, because ( assuming the VM supports it ) it can instruct the VM to discard the data and free the unused space, and without the bother of writing zeros to the space. — psusi, Feb 24 '22 at 20:30
@mrtksy, Could you just make a backup and then decommission the VM? Or maybe use a smaller system disk with an additional disk that you can plug in for data storage when you need, and dispose of it when you don't ( possibly after making a backup if you may need to restore it later )? — psusi, Feb 24 '22 at 20:32

score 0 · Answer 1 · answered Sep 09 '21 at 22:26

Disks normally have file systems. File systems have user data and file system metadata. The details depend on the disk partitioning scheme and file system type. The snapshot consists of changed disk blocks. This includes disk data blocks that were allocated, modified, and then deallocated by the file system.

Your block-zero strategy is increasing the number of changed blocks which means recovering the snapshot will take longer. Note: persistent disks can be recovered from snapshots in a lazy manner which gives the appearance of fast recovery while the actual data restoration takes place. However, that process consumes disk bandwidth transferring the data in the background.

Recommendation:

Use tar or similar archive tools and save the files to Cloud Storage as a compressed archive. Recreating a persistent disk, partitioning and formatting are very easy and in most cases takes seconds. Then restore the saved files.

Thanks for the quick response. When you say "changed blocks", what is this change with respect to? To make my question clearer, I don't have hourly or daily snapshots - I have a single snapshot that I manually create based on the disk, so it should contain all the information on the disk as opposed to any incremental information. Your idea makes perfect sense, but what I want to capture the whole state of the disk, including any installed repos using apt. Do you have any recommendations for this use-case? Thanks — mrtksy, Sep 09 '21 at 22:42
@mrtksy Changed blocks are simply that. Disk blocks that are modified even if the new data is the same as what was previously there. If you want the convenience and features of snapshots, then use snapshots. If disk space used is your goal, then either clone the files to another persistent disk and then snapshot OR use a file archiver. — John Hanley, Sep 09 '21 at 22:53

Disk snapshot size created from a google compute engine exceeds used space

1 Answers1