Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
7
votes
1 answer

Custom DNS resolver for Google Cloud Dataflow pipeline

I am trying to access Kafka and 3rd-party services (e.g., InfluxDB) running in GKE, from a Dataflow pipeline. I have a DNS server for service discovery, also running in GKE. I also have a route in my network to access the GKE IP range from Dataflow…
7
votes
1 answer

GCP Dataflow: System Lag for streaming from Pub/Sub IO

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric. 1) What does system lag exactly…
user_1357
  • 7,766
  • 13
  • 63
  • 106
7
votes
2 answers

Cancelling jobs without dataloss on DataFlow

I'm trying to find a way gracefully end my jobs, so as not to lose any data, streaming from PubSub and writing to BigQuery. A possible approach I can envision is to have the job stop pulling new data and then run until it has processed everything,…
MffnMn
  • 410
  • 4
  • 10
7
votes
1 answer

How do I make sure my Dataflow pipeline scales?

We've often seen people write Dataflow pipelines that don't scale well. This is frustrating since Dataflow is meant to scale transparently, but there still are some antipatterns in Dataflow pipelines that make it difficult to scale. What are some…
Reuven Lax
  • 771
  • 5
  • 9
7
votes
1 answer

Using custom DataFlow unbounded source on DirectPipelineRunner

I'm writing a custom DataFlow unbounded data source that reads from Kafka 0.8. I'd like to run it locally using the DirectPipelineRunner. However, I'm getting the following stackstrace: Exception in thread "main" java.lang.IllegalStateException: no…
Thomas
  • 653
  • 5
  • 18
7
votes
2 answers

Long lived state with Google Dataflow

Just trying to get my head around the programming model here. Scenario is I'm using Pub/Sub + Dataflow to instrument analytics for a web forum. I have a stream of data coming from Pub/Sub that looks like: ID | TS | EventType 1 | 1 | Create 1 | 2 …
bfabry
  • 1,874
  • 1
  • 13
  • 21
7
votes
1 answer

Skipping header rows - is it possible with Cloud DataFlow?

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields). Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when…
Graham Polley
  • 14,393
  • 4
  • 44
  • 80
6
votes
1 answer

Dataflow job failing due to ZONE_RESOURCE_POOL_EXHAUSTED in europe-west3 region

My dataflow job has been failing since 7AM this morning with error: Startup of the worker pool in zone europe-west3-c failed to bring up any of the desired 1 workers. ZONE_RESOURCE_POOL_EXHAUSTED: Instance '' creation failed: The zone…
marcoseu
  • 3,892
  • 2
  • 16
  • 35
6
votes
1 answer

Debugging a Google Dataflow Streaming Job that does not work expected

I am following this tutorial on migrating data from an oracle database to a Cloud SQL PostreSQL instance. I am using the Google Provided Streaming Template Datastream to PostgreSQL At a high level this is what is expected: Datastream exports in…
6
votes
1 answer

What is a correct RestrictionT to use for Splittable DoFn reading an unbounded Iterable?

I am writing a Splittable DoFn to read a MongoDB change stream. It allows me to observe events describing changes to a collection, and I can start reading at an arbitrary cluster timestamp I want, provided oplog has enough history. Cluster…
Patryk Koryzna
  • 475
  • 4
  • 13
6
votes
3 answers

Apache Beam - Bigquery streaming insert showing RuntimeException: ManagedChannel allocation site

I am running a streaming Apache beam pipeline in Google Dataflow. It's reading data from Kafka and streaming insert to Bigquery. But in the bigquery streaming insert step it's throwing a large number of warning - java.lang.RuntimeException:…
6
votes
1 answer

How to publish to Pub/Sub from Dataflow in batch (efficiently)?

I want to publish messages to a Pub/Sub topic with some attributes thanks to Dataflow Job in batch mode. My dataflow pipeline is write with python 3.8 and apache-beam 2.27.0 It works with the @Ankur solution here :…
6
votes
1 answer

Experiencing slow streaming writes to BigQuery from Dataflow pipeline?

I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23. Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the BigQuery step the pipeline scales up to 6 workers all…
Philipp
  • 141
  • 1
  • 5
6
votes
1 answer

Dataset was not found in location US

I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in location US Never have I run, deployed or used any US…
Thijs
  • 1,423
  • 15
  • 38
6
votes
1 answer

How to install private repository on Dataflow Worker?

We're facing issues during Dataflow jobs deployment. The error We are using CustomCommands to install private repo on workers, but we face now an error in the worker-startup logs of our jobs: Running command: ['pip', 'install',…
Colin Le Nost
  • 460
  • 4
  • 10