-1

I am looking for the best Google Cloud solution to stream & batch process files into a clean database on BigQuery. I could use Dataflow or Cloud functions to achieve this result. However, I am struggling a bit with the architecture/design and the finding the right solution for my needs as it has one particular complexity.

The issue is that the source files get delivered asynchronously, but that I can only parse an individual file if I have the previous file as well.

Therefore, if the delivery of one file triggers the parsing mechanism, I need to find the previous file first before I can actually parse and store the results on BigQuery. To illustrate:

| t | file delivery | file parsing              | comment                                                           |   |
|---|---------------|---------------------------|-------------------------------------------------------------------|---|
| 1 | date3         | date3+date2 & date4+date3 | date2 and date4 are not there yet, so I cannot parse these combos |   |
| 2 | date2         | date2+date1 & date3+date2 | date1 is not there yet, so I cannot parse this combo              |   |
| 3 | date6         | date6+date5 &             | date5 and date7 are not there yet, so I cannot parse these combos |   |

Another solution is to stream the raw files into BigQuery, and then periodically check if I have the individual dates and process them to a clean table. However, this means that I cannot create a trigger on the BigQuery table as its multi complex.

Any suggestions on the architecture and or solutions to handle this?

WJA
  • 6,676
  • 16
  • 85
  • 152

1 Answers1

0

Your architecture should be something like,pub/sub, GCS buckets, Cloud functions and Dataflow.

You could configure pub/sub to trigger notifications based on your use case of needing the previous file, before processing the current. The pub/sub notifications are well documented here Pub/Sub notifications for Cloud Storage.

A very good post that will help you is this Input source reading patterns in Google Cloud Dataflow

Paddy Popeye
  • 1,634
  • 1
  • 16
  • 29