You can try using BlockingDataflowPipelineRunner
and run arbitrary logic in your main program after p.run()
(it will wait for the pipeline to finish).
See Specifying Execution Parameters, specifically the section "Blocking execution".
However, in general, it seems that you really want a continuously running pipeline that watches the directory with CSV files and imports new files as they appear, never importing the same file twice. This would be a great case for a streaming pipeline: you could write a custom UnboundedSource (see also Custom Sources and Sinks) that would watch a directory and return filenames in it (i.e. the T
would probably be String
or GcsPath
):
p.apply(Read.from(new DirectoryWatcherSource(directory)))
.apply(ParDo.of(new ReadCSVFileByName()))
.apply(the rest of your pipeline)
where DirectoryWatcherSource
is your UnboundedSource
, and ReadCSVFileByName
is also a transform you'll need to write that takes a file path and reads it as a CSV file, returning the records in it (unfortunately right now you cannot use transforms like TextIO.Read
in the middle of a pipeline, only at the beginning - we're working on fixing this).
It may be somewhat tricky, and as I said we have some features in the works to make it a lot simpler and we're considering creating a built-in source like that, but it's possible that for now this would still be easier than "pinball jobs". Please give it a try and let us know at dataflow-feedback@google.com
if anything is unclear!
Meanwhile, you can also store information about which files you have or haven't processed in Cloud Bigtable - it'd be a better fit for that than BigQuery, because it's more suited for random writes and lookups, while BigQuery is more suited for large bulk writes and queries over the full dataset.