0

I have about 54 000 files in my GCP bucket. When I try to schedule a big query data transfer job to move files from GCP bucket to big query, I am getting the following error:

Error code 9 : Transfer Run limits exceeded. Max size: 15.00 TB. Max file count: 10000. Found: size = 267065994 B (0.00 TB) ; file count = 54824.

I thought the max file count was 10 million.

James Z
  • 12,209
  • 10
  • 24
  • 44
codechef
  • 65
  • 1
  • 7
  • Looks like you're hitting BigQuery's _"Maximum number of source URIs in job configuration — 10,000 URIs"_ quota. Can you share more details on how you are setting up your DTJ i.e. what does your "Data Source Details" look like? – Graham Polley Jun 21 '20 at 07:48
  • @GrahamPolley Source is Google Cloud Storage. Destination dataset is a dataset in big query. Destination table is a table already created in big query. Cloud Storage URI is /bucket_name/folder/* - this folder contains 54000 files. Write preference is APPEND. File format is Parquet. – codechef Jun 21 '20 at 12:44
  • I basically want to transfer all parquet files in the folder to big query. – codechef Jun 21 '20 at 22:08
  • On the website it says Maximum number of files per load job — 10 Million total files including all files matching all wildcard URIs. Does this mean that I can only transfer 10 million files in total over a period of the job's life cycle? – codechef Jun 22 '20 at 00:00

2 Answers2

1

I think that BigQuery transfer service lists all the files matching the wildcard and then use the list to load them. So it will be same that providing the full list to bq load ... therefore reaching the 10,000 URIs limit. This is probably necessary because BigQuery transfer service will skip already loaded files, so it needs to look them one by one to decide which to actually load.

I think that your only option is to schedule a job yourself and load them directly into BigQuery. For example using Cloud Composer or writing a little cloud run service that can be invoked by Cloud Scheduler.

Alessandro
  • 609
  • 1
  • 4
  • 8
1

The Error message Transfer Run limits exceeded as mentioned before is related to a known limit for Load jobs in BigQuery. Unfortunately this is a hard limit and cannot be changed. There is an ongoing Feature Request to increase this limit but for now there is no ETA for it to be implemented.

The main recommendation for this issue is to split a single operation in multiple processes that will send data in requests that don't exceed this limit. With this we could cover the main question: "Why I see this Error message and how to avoid it?".

Is is normal to ask now "how to automate or perform these actions easier?" I can think of involve more products:

  • Dataflow, which will help you to process the data that will be added to BigQuery. Here is where you can send multiple requests.

  • Pub/Sub, will help to listen to events and automate the times where the processing will start.

Please, take a look at this suggested implementation where the aforementioned scenario is wider described.

Hope this is helpful! :)

Kevin Quinzel
  • 1,430
  • 1
  • 13
  • 23
  • If you have issues or additional questions about the last provided solution, I would suggest you to create new questions so we could focus on your concerns about Dataflow or Pub/Sub. – Kevin Quinzel Jun 26 '20 at 00:14