Creating Dataflow jobs for GCS files that come in batches

Question

Given a large batch of new files that don't exclusively match a wildcard string (i.e there may be other files of the same structure in the same folders that were already uploaded and processed). I want to process each of those files through a dataflow job.

I was originally thinking that I would use a cloud function with a cloud storage trigger to trigger the dataflow job for each new file, but those files could show up in bursts of more than 25, so that would exceed the 25 concurrent jobs quota and then they would start failing.

The best I've come up with is queuing them up in pubsub and, since the only option in dataflow is to stream out pubsub and this isn't happening often enough to make that worthwhile, I was thinking writing a custom data flow source that we could schedule to read on a hourly or so basis that processes the files.

Is there a better option?

realizing my solution of queueing them up in pubsub is a no no. Due to the nature of pubsub and need to acknowledge a message to get the next message it doesn't really fit into the batch architecture for dataflow that would take advantage of multiple at a time. Bottom line dataflow solution was great to transform and load all the files that already existed in our system. But just a single compute engine task seems like a better solution for handling new files as they come in. — Steve, Jun 16 '17 at 20:32
Now there is: please see https://stackoverflow.com/questions/47896488/watching-for-new-files-matching-a-filepattern-in-apache-beam/47896489#47896489. — jkff, Dec 19 '17 at 23:09

Creating Dataflow jobs for GCS files that come in batches

0 Answers0