Given a large batch of new files that don't exclusively match a wildcard string (i.e there may be other files of the same structure in the same folders that were already uploaded and processed). I want to process each of those files through a dataflow job.
I was originally thinking that I would use a cloud function with a cloud storage trigger to trigger the dataflow job for each new file, but those files could show up in bursts of more than 25, so that would exceed the 25 concurrent jobs quota and then they would start failing.
The best I've come up with is queuing them up in pubsub and, since the only option in dataflow is to stream out pubsub and this isn't happening often enough to make that worthwhile, I was thinking writing a custom data flow source that we could schedule to read on a hourly or so basis that processes the files.
Is there a better option?