I am looking for the best Google Cloud solution to stream & batch process files into a clean database on BigQuery. I could use Dataflow or Cloud functions to achieve this result. However, I am struggling a bit with the architecture/design and the finding the right solution for my needs as it has one particular complexity.
The issue is that the source files get delivered asynchronously, but that I can only parse an individual file if I have the previous file as well.
Therefore, if the delivery of one file triggers the parsing mechanism, I need to find the previous file first before I can actually parse and store the results on BigQuery. To illustrate:
| t | file delivery | file parsing | comment | |
|---|---------------|---------------------------|-------------------------------------------------------------------|---|
| 1 | date3 | date3+date2 & date4+date3 | date2 and date4 are not there yet, so I cannot parse these combos | |
| 2 | date2 | date2+date1 & date3+date2 | date1 is not there yet, so I cannot parse this combo | |
| 3 | date6 | date6+date5 & | date5 and date7 are not there yet, so I cannot parse these combos | |
Another solution is to stream the raw files into BigQuery, and then periodically check if I have the individual dates and process them to a clean table. However, this means that I cannot create a trigger on the BigQuery table as its multi complex.
Any suggestions on the architecture and or solutions to handle this?