3

I am trying to insert data from Cloud Storage to Big Query using DataFlow (Java). I can Batch upload the data; however, I want to set up a streaming upload instead. So as new objects are added to my bucket, they will get pushed to BigQuery.

I have set up the PipelineOptions to be Streaming and it shows in the GCP Console UI that the dataflow pipeline is of streaming type. My initial set of files/objects in the bucket get pushed to BigQuery.

But as I add new objects to my bucket, these do not get pushed to BigQuery. Why is that? How can I push objects that are added to my Cloud Storage to BigQuery using a steaming dataflow pipeline?

//Specify PipelineOptions
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);


  options.setProject(<project-name>);
  options.setStagingLocation(<bucket/staging folder>);    
  options.setStreaming(true);
  options.setRunner(DataflowRunner.class);

My interpretation is that because this is a streaming pipeline, as I add objects to Cloud Storage, they will get pushed to BigQuery.

Please suggest.

Mikhail Kholodkov
  • 23,642
  • 17
  • 61
  • 78
Andy Cooper
  • 79
  • 2
  • 10
  • Related: https://stackoverflow.com/questions/48197916/automate-file-upload-from-google-cloud-storage-to-bigquery – Lee Jun 03 '18 at 11:25

1 Answers1

2

How do you create your input collection? You need to have an unbounded input for the streaming pipeline to stay on, otherwise it will only be temporary (but will use streaming inserts). You could achieve this by reading from a subscription which has all the changes in your bucket, see https://cloud.google.com/storage/docs/pubsub-notifications for details.

selator
  • 76
  • 2
  • 5
  • thanks for your response. The input will be files uploaded regularly by me or some one else. I was thinking that since I have created a streaming pipeline, it would just take any input from Cloud storage and push it to Pub/sub via a streaming a data pipeline. From there another data pipeline will carry it over to BigQuery. But I see your point - because I am manually uploading the files regularly to the Cloud Storage - it represents a "bounded" input. – Andy Cooper Jun 02 '18 at 23:57
  • As an alternate architecture - can I use Cloud Server functions to create a data flow pipeline when there is any changes to the Cloud Storage bucket? That way the cloud server function - data flow pipeline will carry the data to Pub/sub. From there another streaming data flow pipeline will carry it over to Big Query? As an example: https://codelabs.developers.google.com/codelabs/iot-data-pipeline/index.html?index=..%2F..%2Findex#0 See step #7. – Andy Cooper Jun 03 '18 at 00:04
  • The Notification configuration sends object metadata to PubSub. What if I wanted to the actual object data to be pushed to PubSub? My use case is that I need to take the object/file, read each line, parse it, do some transform, then push it to Big Query. – Andy Cooper Jun 03 '18 at 00:07
  • @AndyCooper https://stackoverflow.com/questions/48197916/automate-file-upload-from-google-cloud-storage-to-bigquery – Lee Jun 03 '18 at 11:25
  • I noticed in Apache Beam 2.2 you can watch for new files - – Andy Cooper Jun 03 '18 at 14:07
  • Here's post. It shows how you can watchfornewfiles in GCS. https://stackoverflow.com/questions/47896488/watching-for-new-files-matching-a-filepattern-in-apache-beam/47896489#47896489 I wrote the same code. But in my ParDo method - I transform the input data & delete the file in GCS. But my code doesnt run. It only deletes the file in GCS but doesnt do transform & write to BigQuery. Why is that? What is difference between Pardo (Do fn) &Splitting Do fn? – Andy Cooper Jun 03 '18 at 14:15
  • To Sum it up: there are 3 ways to do this: 1. Use Splitting Do fn - but it is limited in support with Runners & requires as split do fn function. I am only doing a simply par dofn 2. Use Cron Jobs with GApp Engine to trigger dataflow job 3. Use Cloud Functions to trigger dataflow job – Andy Cooper Jun 03 '18 at 23:05
  • Question - how do you trigger a dataflow from Cloud Functions? I have written a dataflow src and it works fine when I run it locally or using DataFlowRunner. But I would like to trigger it using Cloud functions. This URL asks me to compile the Datflow file as JAR. https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions While this one suggests creating a template https://dzone.com/articles/triggering-dataflow-pipelines-with-cloud-functions Which one is correct? I am confused. Can anyone elaborate? newbie trying to learn – Andy Cooper Jun 03 '18 at 23:06
  • Dataflow jobs can be launched from a JAR file or from a staged template so the same applies to jobs launched with Cloud Functions. In the first case, the Node.js spawn command will use a local Java runtime folder to execute the JAR. You can test the template approach with one of the [Google-provided ones](https://cloud.google.com/dataflow/docs/templates/provided-templates) such as in [this answer](https://stackoverflow.com/a/48601579/6121516). – Guillem Xercavins Jun 04 '18 at 10:59
  • @GuillemXercavins - Thank you so much for clarifying. As newbie I am struggling to understand the code. Here is another link: https://stackoverflow.com/questions/35415868/launching-cloud-dataflow-from-cloud-functions What do the lines mean? ['-cp', 'MY_JAR.jar', 'com.google.cloud.dataflow.examples.WordCount'.. I am asking some really stupid questions but I cant find enough details to walk me through. – Andy Cooper Jun 04 '18 at 16:08
  • Hi Andy, if selator's answer helped you please also consider voting it up :)! – Willian Fuks Jun 05 '18 at 02:02
  • With `-cp` you specify the classpath or location of the user-defined classes and packages ([see here](https://en.wikipedia.org/wiki/Classpath_(Java))). `com.google.cloud.dataflow.examples.WordCount` will be the main class here and the rest are the runtime parameters such as `project` or `inputFile`. – Guillem Xercavins Jun 05 '18 at 08:10