0

I'm using Google Cloud Storage as a source and BigQuery as the destination for a Data Transfer feature that's available in BigQuery.So, on the data source details when creating a new data transfer I need to input the 'Cloud Storage URI', but I don't know how to exclude files with a certain text such as '_All_Data' on the file name.

I've tried to use the pattern bucket/filepath1/2023/*/*/filename_2023*!_All_Data.csv which was provided by ChatGPT, but it didn't work. I was expecting that I would be able to extract all the files in 2023 with the filename starting with 'filename_2023', and exclude all the files that contains '_All_Data'.

Basically, my expectations is that I would be able to extract files such as:

  • bucket/filepath1/2023/05/03/filename_20230503.csv
  • bucket/filepath1/2023/05/02/filename_20230502.csv
  • bucket/filepath1/2023/03/03/filename_20230303.csv

And exclude files such as:

  • bucket/filepath1/2023/05/03/filename_20230503_All_Data.csv
  • bucket/filepath1/2023/05/02/filename_20230502_All_Data.csv
  • bucket/filepath1/2023/03/03/filename_20230303_All_Data.csv

1 Answers1

0

give this a try. I checked this worked. basically you will have to copy the required files to another bucket and then use data transfer service from this new bucket.

the grep -v 'All' will ensure the file names having All in it will be excluded and other files will be copied to target-gcs-bucket

gsutil ls gs://your-source-gcs-bucket/*.* | grep -v 'All' | gsutil cp -I gs://your-target-gcs-bucket/
Bihag Kashikar
  • 1,009
  • 1
  • 2
  • 13
  • I've tried it and it gives me the error 'Bucket names may only contain lowercase letters, numbers, dashes, underscores, and dots' when I input it into the Cloud Storage URI field. – Ronald Dacayanan May 04 '23 at 07:06
  • You probably have not replaced your bucket name properly, can you check this – Bihag Kashikar May 04 '23 at 07:40
  • Were you supposed to submit a link? I've used this 'gsutil ls gs://stor-track/*.* | grep -v '_All_Data' | gsutil cp -I gs://stor-track/' when I did my test. Also, I've pasted the code into chatGPT and it mentioned that 'gsutil cp -I gs://stor-track/' copies the remaining files to the Google Cloud Storage bucket. What I wanted to do is to transfer the data into a table in BigQuery. Just a note, I'm using the 'Data Transfer' feature in BigQuery. – Ronald Dacayanan May 04 '23 at 08:10
  • apologies, what i meant is you will have to copy it to another temporary bucket using the files above and then use data transfer service to copy to bigquery. I will edit my answer above. hope this helps – Bihag Kashikar May 04 '23 at 11:20
  • Why do you need a temporary bucket if you can just directly transfer it to a table in BigQuery? Wouldn't there be a way to fix this pattern that I've tried to use (bucket/filepath1/2023/*/*/filename_2023*!_All_Data.csv) – Ronald Dacayanan May 04 '23 at 12:19