I have a bucket folder in Google Cloud with about 47GB
of data in it. I start a new Kubernetes StatefulSet
(in my Google Cloud Kubernetes cluster). The first thing that the container inside the StatefulSet
does is to use gsutil -m rsync -r gs://<BUCKET_PATH> <LOCAL_MOUNT_PATH>
to sync the bucket folder contents to a locally mounted folder, which corresponds to a Kubernetes Persistent Volume. The Persistent Volume Claim for this StatefulSet
requests 125Gi
of storage and is only used for this rsync
. But the gsutil
sync eventually hits a wall where the pod runs out of disk space (space in the Persistent Volume) and gsutil
throws an error: [Errno 28] No space left on device
. This is weird, because I only need to copy 47GB
of data over from the bucket, but the Persistent Volume should have 125Gi
of storage available.
I can confirm the Persistent Volume Claim and the Persistent Volume have been provisioned with the appropriate sizes by using kubectl get pvc
and kubectl get pv
. If I run df -h
inside the pod (kubectl exec -it <POD_NAME> -- df -h
) I can see that the mounted path exists and that it has the expected size (125Gi
). Using df -h
during the sync I can see that it does indeed take up all the available space in the Persistent Volume when it finally hits No space left on device
.
Further, if I provision a Persistent Volume of 200Gi
and retry the sync, it finishes successfully and df -h
shows that the used space in the Persistent Volume is 47GB
, as expected (this is after gsutil rsync
is completed).
So it seems that gsutil rsync
uses far more space while syncing than I would expect. Why is this? Is there a way to change how gsutil rsync
is done so that it doesn't require a larger Persistent Volume than necessary?
It should be noted that there are a lot of individual files, and that the pod is restarted about 8 times during the sync.