4

I really liked BigQuery's Data Transfer Service. I have flat files in the exact schema sitting to be loaded into BQ. It would have been awesome to just setup DTS schedule that picked up GCS files that match a pattern and load the into BQ. I like the built in option to delete source files after copy and email in case of trouble. But the biggest bummer is that the minimum interval is 60 minutes. That is crazy. I could have lived with a 10 min delay perhaps.

So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.

Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.

If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?

Lastly, anyone know if DTS will lower the limit to 10 mins in future?

AIK DO
  • 288
  • 1
  • 4
  • 13

4 Answers4

3

So if I set up the DTS to be on demand, how can I invoke it from an API? I am thinking create a cronjob that calls it on demand every 10 mins. But I can’t figure out through the docs how to call it.

StartManualTransferRuns is part of the RPC library but does not have a REST API equivalent as of now. How to use that will depend on your environment. For instance, you can use the Python Client Library (docs).

As an example, I used the following code (you'll need to run pip install google-cloud-bigquery-datatransfer for the depencencies):

import time

from google.cloud import bigquery_datatransfer_v1
from google.protobuf.timestamp_pb2 import Timestamp


client = bigquery_datatransfer_v1.DataTransferServiceClient()

PROJECT_ID = 'PROJECT_ID'
TRANSFER_CONFIG_ID = '5e6...7bc'  # alphanumeric ID you'll find in the UI 

parent = client.project_transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)

start_time = bigquery_datatransfer_v1.types.Timestamp(seconds=int(time.time() + 10))

response = client.start_manual_transfer_runs(parent, requested_run_time=start_time)
print(response)

Note that you'll need to use the right Transfer Config ID and the requested_run_time has to be of type bigquery_datatransfer_v1.types.Timestamp (for which there was no example in the docs). I set a start time 10 seconds ahead of the current execution time.

You should get a response such as:

runs {
  name: "projects/PROJECT_NUMBER/locations/us/transferConfigs/5e6...7bc/runs/5e5...c04"
  destination_dataset_id: "DATASET_NAME"
  schedule_time {
    seconds: 1579358571
    nanos: 922599371
  }
  ...
  data_source_id: "google_cloud_storage"
  state: PENDING
  params {
    ...
  }
  run_time {
    seconds: 1579358581
  }
  user_id: 28...65
}

and the transfer is triggered as expected (nevermind the error):

enter image description here

Also, what is my second best most reliable and cheapest way of moving GCS files (no ETL needed) into bq tables that match the exact schema. Should I use Cloud Scheduler, Cloud Functions, DataFlow, Cloud Run etc.

With this you can set a cron job to execute your function every ten minutes. As discussed in the comments, the minimum interval is 60 minutes so it won't pick up files less than one hour old (docs).

Apart from that, this is not a very robust solution and here come into play your follow-up questions. I think these might be too broad to address in a single StackOverflow question but I would say that, for on-demand refresh, Cloud Scheduler + Cloud Functions/Cloud Run can work very well.

Dataflow would be best if you needed ETL but it has a GCS connector that can watch a file pattern (example). With this you would skip the transfer, set the watch interval and the load job triggering frequency to write the files into BigQuery. VM(s) would be running constantly in a streaming pipeline as opposed to the previous approach but a 10-minute watch period is possible.

If you have complex workflows/dependencies, Airflow has recently introduced operators to start manual runs.

If I use Cloud Function, how can I submit all files in my GCS at time of invocation as one bq load job?

You can use wildcards to match a file pattern when you create the transfer:

enter image description here

Also, this can be done on a file-by-file basis using Pub/Sub notifications for Cloud Storage to trigger a Cloud Function.

Lastly, anyone know if DTS will lower the limit to 10 mins in future?

There is already a Feature Request here. Feel free to star it to show your interest and receive updates

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • Thank you so much for the very detailed answer. Quick question, even if I'm able to trigger the DTS run, is it the case that it will not look at any files unless older than 60 minutes? Reason I ask is that I manually triggered a DTS job and it said no files older than 60 minutes found? So it seems like I can not get away with importing files every 10 mins from moment of creation and there would always be a delay of 60 mins if using DTS. Can you confirm? – AIK DO Jan 18 '20 at 17:16
  • Yes, seems to be documented [here](https://cloud.google.com/bigquery-transfer/docs/cloud-storage-transfer#minimum_intervals). If you need more frequent updates you can use Dataflow or [Pub/Sub notifications for Cloud Storage](https://cloud.google.com/storage/docs/pubsub-notifications) to trigger a Cloud Function. – Guillem Xercavins Jan 18 '20 at 19:10
  • That's just a super weird limitation that files need to be an hour old to be picked up. Oh well. – AIK DO Jan 18 '20 at 21:21
  • I get that "project_transfer_config_path" is not part of client – Thadeu Melo Oct 13 '20 at 20:22
  • Be aware that requested_run_time (requestedRunTime) must NOT be in the future: https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns?hl=en#request-body So, "set a start time 10 seconds ahead of the current execution time" won't have the effect of a scheduled start in 10 sec from now. requested_run_time helps to perform backfills by virtually run the data transfer on a specific time in the past: https://cloud.google.com/bigquery/docs/scheduling-queries?hl=en#set_up_a_manual_run_on_historical_dates – Yves Jun 10 '22 at 11:13
  • The 1-hour limitation has been removed. https://cloud.google.com/bigquery-transfer/docs/release-notes#February_24_2021 – MonicaPC Mar 24 '23 at 20:17
2

Now your can easy manual run transfer Bigquery data use RESTApi:

HTTP request
POST https://bigquerydatatransfer.googleapis.com/v1/{parent=projects/*/locations/*/transferConfigs/*}:startManualRuns
  • About this part > {parent=projects//locations//transferConfigs/*}, check on CONFIGURATION of your Transfer then notice part like image bellow.

Here

More here: https://cloud.google.com/bigquery-transfer/docs/reference/datatransfer/rest/v1/projects.locations.transferConfigs/startManualRuns

  • I believe the address has changed. I was able to run it from the documentation page. But I need to run it from a cloud function, with authentication. Could you help with a more complete solution? – Thadeu Melo Oct 13 '20 at 21:54
1

following the Guillem's answer and the API updates, this is my new code:

    import time
    from google.cloud.bigquery import datatransfer_v1
    from google.protobuf.timestamp_pb2 import Timestamp


    client = datatransfer_v1.DataTransferServiceClient()
    config = '34y....654'

    PROJECT_ID = 'PROJECT_ID'
    TRANSFER_CONFIG_ID = config

    parent = client.transfer_config_path(PROJECT_ID, TRANSFER_CONFIG_ID)

    start_time = Timestamp(seconds=int(time.time()))

    request = datatransfer_v1.types.StartManualTransferRunsRequest(
        { "parent": parent, "requested_run_time": start_time }
    )

    response = client.start_manual_transfer_runs(request, timeout=360)
    print(response)
lisandro
  • 11
  • 2
-1

For this to work, you need to know the correct TRANSFER_CONFIG_ID.

In my case, I wanted to list all the BigQuery Scheduled queries, to get a specific ID. You can do it like that :

# Put your projetID here
PROJECT_ID = 'PROJECT_ID'

from google.cloud import bigquery_datatransfer_v1

bq_transfer_client = bigquery_datatransfer_v1.DataTransferServiceClient()
parent = bq_transfer_client.project_path(PROJECT_ID)

# Iterate over all results
for element in bq_transfer_client.list_transfer_configs(parent):
   
   # Print Display Name for each Scheduled Query
   print(f'[Schedule Query Name]:\t{element.display_name}')
   
   # Print name of all elements (it contains the ID)
   print(f'[Name]:\t\t{element.name}')
   
   # Extract the IDs:
   TRANSFER_CONFIG_ID= element.name.split('/')[-1]
   print(f'[TRANSFER_CONFIG_ID]:\t\t{TRANSFER_CONFIG_ID}')

   # You can print the entire element for debug purposes
   print(element)