0

I am using gcloud storage cp command to copy large number of files from one gcp bucket to another bucket using below command

gcloud storage cp -r "gs://test-1/*" "gs://test-3" --encryption-key=XXXXXXXXXXXXXXXXXXXXXXX --storage-class=REGIONAL

I have a use case where I want to copy files but skip files which are already copied.

--manifest-path can solve this problem for me using below command.

gcloud storage cp -r "gs://test-1/*" "gs://test-3" --encryption-key=XXXXXXXXXXXXXXXXXXXXXXX --manifest-path=manifest.csv --storage-class=REGIONAL

However I will be running this command on k8s so pod storage will be ephemeral and this file will be lost so I want to keep it hosted somewhere.

I tried passing google cloud storage location for manifest file but it gave me errors.

gcloud storage cp -r "gs://test-1/*" "gs://test-3" --encryption-key=XXXXXXXXXXXXXXXXXXXXXXX --manifest-path=gs://manifests-bucket/manifest.csv --storage-class=REGIONAL
ERROR: (gcloud.storage.cp) Unable to write file [gs://manifests-bucket/manifest.csv]: [Errno 2] No such file or directory: 'gs://manifests-bucket/manifest.csv'

How can I pass manifest file path as google cloud storage bucket file path ?

References : https://cloud.google.com/sdk/gcloud/reference/storage/cp#--manifest-path

EDIT 1 :

Tried giving permissions to bucket assuming gcloud storage cp uses storage-transfer-service service account behind the scenes.

gsutil iam ch serviceAccount:project-XXXXXXXXX@storage-transfer-service.iam.gserviceaccount.com:objectCreator,legacyBucketReader gs://manifests-bucket/

References :

https://cloud.google.com/storage-transfer/docs/manifest https://cloud.google.com/storage-transfer/docs/source-cloud-storage#grant_the_required_permissions

EDIT 2

Tried gsutil rsync command by passing encryption key, it doesn't do anything. output of command is attached below as well.

➜ gsutil -m -o "GSUtil:encryption_key=XXXXXXXXXXXXXXXXX" rsync gs://test-1 gs://test-3

WARNING: gsutil rsync uses hashes when modification time is not available at
both the source and destination. Your crcmod installation isn't using the
module's C extension, so checksumming will run very slowly. If this is your
first rsync since updating gsutil, this rsync can take significantly longer than
usual. For help installing the extension, please see "gsutil help crcmod".

Building synchronization state...
If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.

Starting synchronization...
If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.
SRJ
  • 2,092
  • 3
  • 17
  • 36
  • did you try with rsync? – guillaume blaquiere Mar 29 '23 at 17:57
  • Afaik and tried as well, rsync doesn't support syncing of buckets where source bucket is encrypted with csek. – SRJ Mar 29 '23 at 18:18
  • 1
    Never tried that combination.... Sorry – guillaume blaquiere Mar 29 '23 at 19:21
  • 1
    Try setting the `encryption_key` in the gsutil `Boto configuration file`: https://cloud.google.com/storage/docs/boto-gsutil The setting is under `[GSUtil]` Then you can use `gsutil rsync`. – John Hanley Mar 29 '23 at 20:00
  • @JohnHanley Thanks and I already tried with gsutil rsync by passing encryption key and it didn't work. Updated Question :) – SRJ Mar 29 '23 at 20:37
  • @JohnHanley gsutil rsync with source bucket encrypted with AES works fine only for individual objects. but for syncing bucket to bucket, it is unfortunately not working. – SRJ Mar 29 '23 at 20:39
  • 2
    Two options come to mind: a) review the gsutil source code, which is public, and fix the feature; b) file a bug report: https://cloud.google.com/support/docs/issue-trackers for `gsutil` and file a feature request for `gcloud storage`. If this was my problem, I would just write a Python script to implement this requirement but that takes knowledge of the storage SDK and JSON API. Google has a GitHub repo with example code for CSEK: https://github.com/googleapis/python-storage/tree/main/samples/snippets – John Hanley Mar 29 '23 at 21:11
  • Thanks @JohnHanley I can write Java program to copy the data but the challenge is I am dealing at a scale with million of files so I want to use something that is provided out of box by google like data transfer service. since data transfer is not supporting currently transfers with CSEKs So I have to resort on `gcloud storage cp` which is also very fast but with this minor challenge I posted in question. But again thanks for your input :) – SRJ Mar 29 '23 at 21:17
  • Why are you implementing CSEK? If the reason is for security, then why do you want to use public tools? You do not trust Google-managed encryption but trust public open-source tools? I realize that this is not the solution you are looking for, but think about what you want to accomplish, why and the correct secure solution. – John Hanley Mar 29 '23 at 21:32

0 Answers0