3

Is it possible to move a file in GCS after the dataflow pipeline has finished running? If so, how? Should be the last .apply? I can't imagine that being the case.

The case here is that we are importing a lot of .csv's from a client. We need to keep those CSV's indefinitely, so we either need to "mark the CSV as being already handled", or alternatively, move them out of the initial folder that TextIO uses to find the csv's. The only thing I can currently think of is storing the file name (I'm not sure how I'd get this even, I'm a DF newbie) in BigQuery perhaps, and then excluding files that have already been stored from the execution pipeline somehow? But there has to be a better approach.

Is this possible? What should I check out?

Thanks for any help!

iLikeBreakfast
  • 1,545
  • 23
  • 46

1 Answers1

5

You can try using BlockingDataflowPipelineRunner and run arbitrary logic in your main program after p.run() (it will wait for the pipeline to finish).

See Specifying Execution Parameters, specifically the section "Blocking execution".

However, in general, it seems that you really want a continuously running pipeline that watches the directory with CSV files and imports new files as they appear, never importing the same file twice. This would be a great case for a streaming pipeline: you could write a custom UnboundedSource (see also Custom Sources and Sinks) that would watch a directory and return filenames in it (i.e. the T would probably be String or GcsPath):

p.apply(Read.from(new DirectoryWatcherSource(directory)))
 .apply(ParDo.of(new ReadCSVFileByName()))
 .apply(the rest of your pipeline)

where DirectoryWatcherSource is your UnboundedSource, and ReadCSVFileByName is also a transform you'll need to write that takes a file path and reads it as a CSV file, returning the records in it (unfortunately right now you cannot use transforms like TextIO.Read in the middle of a pipeline, only at the beginning - we're working on fixing this).

It may be somewhat tricky, and as I said we have some features in the works to make it a lot simpler and we're considering creating a built-in source like that, but it's possible that for now this would still be easier than "pinball jobs". Please give it a try and let us know at dataflow-feedback@google.com if anything is unclear!

Meanwhile, you can also store information about which files you have or haven't processed in Cloud Bigtable - it'd be a better fit for that than BigQuery, because it's more suited for random writes and lookups, while BigQuery is more suited for large bulk writes and queries over the full dataset.

jkff
  • 17,623
  • 5
  • 53
  • 85
  • Thanks, I was under the impression the `BlockingDataflowPipelineRunner` was an asynchronous operation (should have properly thought through the "Blocking" part of the name). Thanks for clearing that up. Now that you mention the continuous directory watching, are you saying that it's not possible at the moment? We'll probably then have a couple of pinball jobs executing previously mentioned checks (whether the file has already been done) every couple of hours, but that could potentially cause some race conditions - any suggestions? – iLikeBreakfast Oct 11 '15 at 10:13
  • I edited my answer to provide another alternative and explain more about the potential "in the works" features. – jkff Oct 11 '15 at 19:15
  • Excellent, makes much more sense now! Thanks! – iLikeBreakfast Oct 12 '15 at 09:34
  • @jkff Have any new features in Google Cloud Dataflow been implemented since the original question was asked in 2015? It seems really cumbersome and hacky to do usual ETL pre/post processing steps for a pipeline. What is the best practice approach today? – jimmy Mar 27 '17 at 21:58
  • 1
    I think my response to the other question makes sense here too: http://stackoverflow.com/questions/36365058/triggering-a-dataflow-job-when-new-files-are-added-to-cloud-storage/36365371?noredirect=1#comment73198732_36365371 – jkff Mar 28 '17 at 15:51