0

I'm working with another team, who has already achieved near real time (NRT) load in their GCP bigquery project dataset. The objective on our side is to utilize their NRT datasets to create another/several NRT tables on our side. This could involve (as a initial test) left joining two NRT tables, aggregating using group by etc.

Is there a way to achieve this using something like an event trigger (or equivalent terminology in GCP)?

What I've searched so far is using pub/sub and Dataflow in GCP. However my understanding is if I use this way, my whole process is becoming independent of what our upper stream team has done for us.

Can someone give me some suggestions?

Lambo
  • 857
  • 3
  • 14
  • 39
  • I think for your use case, google eventarc will work, basically with eventarc you trigger an event based on table creation in dataset. see here, i think the use case here is similar to yours https://cloud.google.com/eventarc/docs/run/bigquery – Bihag Kashikar Apr 11 '23 at 06:11
  • I would suggest a couple of things to consider. First- it not necessary to copy data from one project to another to get access to (read) the data. One can run SQL query jobs reading data across projects and datasets. The restriction is the location. Should be the same. The second thing - when you use the word “trigger” - what is the trigger in your case? Taking into account that the source is a near real time streaming, I guess. – al-dann Apr 11 '23 at 08:50
  • @al-dann, in our case we want to bring any processed data into our project. For the trigger, we want to achieve the simplest scenario at the moment which is when there is a new record, sync to our table. – Lambo Apr 11 '23 at 11:20
  • When a SQL query job is executed, it is always possible to save the result in some dataset, is not it? – al-dann Apr 11 '23 at 12:07
  • Triggers. A trigger per appended record. And no windows for grouping and aggregation. Just for clarification. – al-dann Apr 11 '23 at 12:10
  • I agree with @al-dann query the NRT data and sink the result in your own tables. Or you have to fork the upstream process. But it seems weird to duplicate data. What is your source of truth in that case? – guillaume blaquiere Apr 13 '23 at 09:59
  • well... if one (we, or a company) really worries about duplication/copy of data - it is always possible to run a SQL query job reading the original table (subject to IAM roles and data location); in this way the storage of a copy of data can be eliminated completely. – al-dann Apr 13 '23 at 10:18
  • From my personal point of view, in our spatial-temporal continuum, the classification of any data depends on the data consumer (who makes such classification from their frame of reference, context and scope), so some of them, might classify the data as 'true', others as 'false' and some others in some other way... – al-dann Apr 13 '23 at 10:37

0 Answers0