Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

votes

1 answer

Custom DNS resolver for Google Cloud Dataflow pipeline

I am trying to access Kafka and 3rd-party services (e.g., InfluxDB) running in GKE, from a Dataflow pipeline. I have a DNS server for service discovery, also running in GKE. I also have a route in my network to access the GKE IP range from Dataflow…

google-cloud-platform google-cloud-dataflow google-kubernetes-engine

asked Mar 16 '17 at 20:46

peay

votes

1 answer

GCP Dataflow: System Lag for streaming from Pub/Sub IO

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric. 1) What does system lag exactly…

streaming google-cloud-platform google-cloud-dataflow dataflow

asked Mar 01 '17 at 21:29

user_1357

7,766
13
63
106

votes

2 answers

Cancelling jobs without dataloss on DataFlow

I'm trying to find a way gracefully end my jobs, so as not to lose any data, streaming from PubSub and writing to BigQuery. A possible approach I can envision is to have the job stop pulling new data and then run until it has processed everything,…

google-cloud-dataflow google-cloud-pubsub

asked Feb 05 '16 at 11:43

MffnMn

votes

1 answer

How do I make sure my Dataflow pipeline scales?

We've often seen people write Dataflow pipelines that don't scale well. This is frustrating since Dataflow is meant to scale transparently, but there still are some antipatterns in Dataflow pipelines that make it difficult to scale. What are some…

google-cloud-dataflow

asked Jan 22 '16 at 21:23

Reuven Lax

votes

1 answer

Using custom DataFlow unbounded source on DirectPipelineRunner

I'm writing a custom DataFlow unbounded data source that reads from Kafka 0.8. I'd like to run it locally using the DirectPipelineRunner. However, I'm getting the following stackstrace: Exception in thread "main" java.lang.IllegalStateException: no…

google-cloud-dataflow

asked Jan 22 '16 at 14:18

Thomas

votes

2 answers

Long lived state with Google Dataflow

Just trying to get my head around the programming model here. Scenario is I'm using Pub/Sub + Dataflow to instrument analytics for a web forum. I have a stream of data coming from Pub/Sub that looks like: ID | TS | EventType 1 | 1 | Create 1 | 2 …

google-cloud-platform google-cloud-dataflow

asked Nov 16 '15 at 22:36

bfabry

1,874
1
13
21

votes

1 answer

Skipping header rows - is it possible with Cloud DataFlow?

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields). Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when…

google-cloud-dataflow

asked Feb 11 '15 at 09:19

Graham Polley

14,393
4
44
80

votes

1 answer

Dataflow job failing due to ZONE_RESOURCE_POOL_EXHAUSTED in europe-west3 region

My dataflow job has been failing since 7AM this morning with error: Startup of the worker pool in zone europe-west3-c failed to bring up any of the desired 1 workers. ZONE_RESOURCE_POOL_EXHAUSTED: Instance '' creation failed: The zone…

google-cloud-platform google-cloud-dataflow

asked Jul 05 '22 at 19:22

marcoseu

3,892
2
16
35

votes

1 answer

Debugging a Google Dataflow Streaming Job that does not work expected

I am following this tutorial on migrating data from an oracle database to a Cloud SQL PostreSQL instance. I am using the Google Provided Streaming Template Datastream to PostgreSQL At a high level this is what is expected: Datastream exports in…

postgresql google-cloud-platform google-cloud-dataflow

asked Jan 13 '22 at 21:24

user538578964

votes

1 answer

What is a correct RestrictionT to use for Splittable DoFn reading an unbounded Iterable?

I am writing a Splittable DoFn to read a MongoDB change stream. It allows me to observe events describing changes to a collection, and I can start reading at an arbitrary cluster timestamp I want, provided oplog has enough history. Cluster…

java mongodb google-cloud-dataflow apache-beam

asked Sep 27 '21 at 09:40

Patryk Koryzna

votes

3 answers

Apache Beam - Bigquery streaming insert showing RuntimeException: ManagedChannel allocation site

I am running a streaming Apache beam pipeline in Google Dataflow. It's reading data from Kafka and streaming insert to Bigquery. But in the bigquery streaming insert step it's throwing a large number of warning - java.lang.RuntimeException:…

google-bigquery google-cloud-dataflow apache-beam

asked Jun 01 '21 at 08:58

Parag Ghosh

votes

1 answer

How to publish to Pub/Sub from Dataflow in batch (efficiently)?

I want to publish messages to a Pub/Sub topic with some attributes thanks to Dataflow Job in batch mode. My dataflow pipeline is write with python 3.8 and apache-beam 2.27.0 It works with the @Ankur solution here :…

python-3.x google-cloud-platform google-cloud-dataflow apache-beam google-cloud-pubsub

asked Mar 26 '21 at 17:21

Benjamin

votes

1 answer

Experiencing slow streaming writes to BigQuery from Dataflow pipeline?

I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23. Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the BigQuery step the pipeline scales up to 6 workers all…

python google-cloud-dataflow apache-beam

asked Sep 09 '20 at 07:28

Philipp

votes

1 answer

Dataset was not found in location US

I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in location US Never have I run, deployed or used any US…

python google-cloud-dataflow apache-beam

asked Feb 16 '20 at 10:22

Thijs

1,423
15
38

votes

1 answer

How to install private repository on Dataflow Worker?

We're facing issues during Dataflow jobs deployment. The error We are using CustomCommands to install private repo on workers, but we face now an error in the worker-startup logs of our jobs: Running command: ['pip', 'install',…

python google-cloud-dataflow apache-beam

asked Jan 06 '20 at 16:17

Colin Le Nost

Prev 1 2 3

…

99 100 Next