0

I want to schedule the data transfer job between Cloud Storage to BigQuery. I have one application that dumps data continuously to the GCS bucket path (let's say gs://test-bucket/data1/*.avro) that I want to move to BigQuery as soon as the object is created in GCS.

I don't want to migrate all the files available within the folder again and again. I just want to move only the newly added object after the last run in the folder.

BigQuery data transfer service is available that takes Avro files as input but not a folder and it does not provide only newly added objects instead all.

I am new to it so might be missing some functionality, How can I achieve it?

Please note- I want to schedule a job to load data at a certain frequency (every 10 or 15 min), I don't want any solution from a trigger perspective since the number of objects that will be generated will be huge.

Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24

1 Answers1

1

You can use Cloud Function and Storage event trigger. Just launch Cloud Function that loads data into BigQuery when new file arrives. https://cloud.google.com/functions/docs/calling/storage EDIT: If you have more than 1500 loads per day you can workaround with loading using BQ Storage API.

If you do not need superb performance then you can just create an external table on that folder and query it instead loading every file.

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Piotr Klos
  • 104
  • 4
  • Just a limitation: you can perform up to 1500 load job per day on the same table (JOB done or failed) – guillaume blaquiere Jan 18 '23 at 19:45
  • I dont want to go on trigger basis, this will be a very expensive solution and also not feasible if we cross 1500 limit, my data in continuously coming, mean there will be 1000 files be loaded in one hour using trigger mechanism will be create 1000 trigger, I want to go with schedule basis where my job will be trigger in every 10 min and will check incremental object and load it into BigQuery – Nikhil Suthar Jan 19 '23 at 05:53
  • 1
    if files are small then you can insert rows with BQ Storage API instead loading, but there is more development required. Then limit 1500 is not applicable anymore. Or I'd do VIEW on TABLE and EXTERNAL TABLE. TABLE should keep all partitioned data and EXTERNAL should keep current day. VIEW should UNION ALL both. You can query all via VIEW, but at the end of every day just LOAD all daily files to main table. I've used both approaches in my projects, the choice depends on volume, querying needs, budget. – Piotr Klos Jan 20 '23 at 12:06
  • that will works, thanks – Nikhil Suthar Jan 24 '23 at 05:16