-1

I have a number of .csv files of tabular data stored in different folders of a Cloud Storage bucket that have been imported from an external data source. Every day, a new file is imported into each folder of the Cloud Storage bucket. Each file contains a whitespace (" ") in the filename with the ".csv" extension. I have written a Cloud Function to copy every existing file from this source bucket to a newly created cleaned bucket and modify the filename by replacing the space " " character with a dash "-" character. Is there a way to implement that the Cloud Function only does this to the new file being uploaded using Cloud Functions and Pub/Sub instead of the approach of doing a manual scan of which files are in both buckets? Essentially what I would like to do is to send and access the filename and file metadata in the Pub/Sub event, but I am not aware of how to send and access this data in the Pub/Sub event.

Thanks in advance!

Kindly,

Bertan

Berra
  • 35
  • 7
  • I down-voted because https://idownvotedbecau.se/noattempt/ – Renaud Pacalet Jul 25 '23 at 07:53
  • Welcome to Stack Overflow! You seem to be asking for someone to write some code for you. Stack Overflow is a question and answer site, not a code-writing service. Please [see here](http://stackoverflow.com/help/how-to-ask) to learn how to write effective questions. – John Hanley Jul 25 '23 at 13:06
  • Hi,I have provided an answer below.please check and let me know if the below suggestions were helpful – Sathi Aiswarya Jul 31 '23 at 12:40

1 Answers1

2

This Answer by Marc Anthony B explains renaming the filename by removing square brackets []. You can follow the same to remove white space and replace with underscore by changing the regex pattern like below.

The code will basically follow these 3 steps

  1. List the objects that you want to rename.
  2. Iterate that list.
  3. For each object, change the name. The files aren´t renamed in the backend. It performs a copy followed by a delete for each object you're renaming.
import re
from google.cloud import storage

storage_client = storage.Client()

bucket_name = "my_bucket"
bucket = storage_client.bucket(bucket_name)

storage_client = storage.Client()

blobs = storage_client.list_blobs(bucket_name)
pattern = r"\s"  #  regex for detecting whitespace
for blob in blobs:
    if re.match(pattern, blob.name):
        fixed_var = re.sub(pattern, "_", blob.name)
        new_blob = bucket.rename_blob(blob, fixed_var)
        print("Changed")
    print("No change required")

You can also use the gsutil mv command to rename all objects with a given prefix to have a new prefix.you can refer this document for more information

gsutil mv gs://my_bucket/oldprefix gs://my_bucket/newprefix

Sathi Aiswarya
  • 2,068
  • 2
  • 11
  • Suggestion: only call `rename_blob()` for objects that are to be renamed. Your code will call `rename_blob()` for all objects in the list. A simple string compare will do the trick. – John Hanley Jul 25 '23 at 13:05
  • 2
    Thank you @John Hanley, For the suggestion. updated my answer. – Sathi Aiswarya Jul 31 '23 at 13:19