1

I referred to below link: How to upload folder on Google Cloud Storage using Python API

I wanted create a script for folder upload to GCS asynchronously same as gustil Rsync, but in python for excluding filetypes image and video.

I have a script to upload a folder to GCS bucket synchronously.

import glob
import os 
from google.cloud import storage
from gcloud import storage
from gcloud.aio.storage import Storage 
from oauth2client.service_account import ServiceAccountCredentials
import time

#GCS_CLIENT = storage.Client()
credentials_dict = {
  XXXXXX
}
credentials = ServiceAccountCredentials.from_json_keyfile_dict(
    credentials_dict
)        
GCS_CLIENT = storage.Client(credentials=credentials, project='XXXXXXXXXXXX')
def upload_from_directory(directory_path: str, dest_bucket_name: str, dest_blob_name: str):
    rel_paths = glob.glob(directory_path + '/**', recursive=True)
    bucket = GCS_CLIENT.get_bucket(dest_bucket_name)
    for local_file in rel_paths:
        remote_path = f'{dest_blob_name}/{"/".join(local_file.split(os.sep)[1:])}'
        if os.path.isfile(local_file):
            blob = bucket.blob(remote_path)
            blob.upload_from_filename(local_file)

This is the Calling Function:

s = time.time()
upload_from_directory("C:/Users/New folder","XXXXXXXXXX","New folder")
print(time.time()-s)

This does works for me but it takes more time, I have 5 TB of data, I want to do it in Parallely. I tried using Joblib to parallelize the function, but it gave me this error.

PicklingError: Could not pickle the task to send it to the workers.

This is my Joblib code:

import glob
import os 
from google.cloud import storage
from gcloud import storage
from gcloud.aio.storage import Storage 
from oauth2client.service_account import ServiceAccountCredentials
import time

#GCS_CLIENT = storage.Client()
credentials_dict = {

XXXXXXXX
    }
    credentials = ServiceAccountCredentials.from_json_keyfile_dict(
        credentials_dict
    )        
    GCS_CLIENT = storage.Client(credentials=credentials, project='XXXXXXXXXX')
    def upload_from_directory(local_file):
        dest_bucket_name = "XXXXXXXXXX"
        dest_blob_name = "New folder"
        bucket = GCS_CLIENT.get_bucket(dest_bucket_name)
        remote_path = f'{dest_blob_name}/{"/".join(local_file.split(os.sep)[1:])}'
        if os.path.isfile(local_file):
            blob = bucket.blob(remote_path)
            blob.upload_from_filename(local_file)

Below is the Multithreading function:

from joblib import Parallel, delayed

s = time.time()

directory_path = "C:/Users/New folder"
rel_paths = glob.glob(directory_path + '/**', recursive=True)
Parallel(n_jobs=8)(delayed(upload_from_directory)(local_file)for local_file in rel_paths)
print(time.time() - s)
  • What is the speed of your Internet upload connection? What is the size of the largest files in the 5 TB dataset? I would not attempt this task in Python and I have a Gigabit connection - about 80 MByte/sec upload speed. I recommend that you reconsider and use the `gsutil` tool. You will be hard-pressed to write more performant parallel code in Python to transfer files to Cloud Storage. – John Hanley Aug 02 '22 at 22:57
  • The source code for `gsutil` is public. I recommend that you review how Google performs this task before reinventing the wheel. – John Hanley Aug 02 '22 at 22:59
  • According to the [GCP official docs](https://cloud.google.com/storage-transfer/docs/cloud-storage-to-cloud-storage#when_to_use), If you are transferring more than 1TB of data, they suggest using the Storage transfer service. – Darwin Aug 03 '22 at 00:51
  • You can check this [Storage transfer service Client Library](https://cloud.google.com/storage-transfer/docs/reference/libraries#linux-or-macos) using python for more information – Darwin Aug 03 '22 at 00:54
  • @JohnHanley please suggest how can we do error handling in gsutil. If any documentation you can share. – Siddhesh Parkar Aug 03 '22 at 05:56
  • Refer to the gsutil source code. That is your best source of documentation. – John Hanley Aug 03 '22 at 05:58

0 Answers0