Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

votes

1 answer

Is there anyway I can use preemptible instance for dataflow jobs?

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I…

asked Feb 09 '20 at 15:07

miles212

votes

1 answer

Kafka cluster loses or duplicates messages

While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read. To…

kubernetes apache-kafka bigdata google-cloud-dataflow apache-beam

asked Sep 12 '19 at 07:26

Michael

votes

2 answers

Problem in specifying the network in cloud dataflow

I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set…

google-cloud-dataflow apache-beam vpc

asked Jul 30 '19 at 10:03

Rim

1,735
2
18
29

votes

1 answer

How to make a generic Protobuf Parser DoFn in python beam?

Context I'm working with a streaming pipeline which has a protobuf data source in pubsub. I wish to parse this protobuf into a python dict because the data sink requires the input to be a collection of dicts. I had developed a Protobuf Parser…

python google-cloud-platform protocol-buffers google-cloud-dataflow apache-beam

asked Apr 26 '19 at 13:52

dekauliya

1,303
2
15
26

votes

2 answers

Issues with Dynamic Destinations in Dataflow

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's…

java google-cloud-storage google-cloud-dataflow apache-beam

asked Apr 18 '19 at 14:59

Scicrazed

votes

2 answers

How to create a Dataflow pipeline from Pub/Sub to GCS in Python

I want to use Dataflow to move data from Pub/Sub to GCS. So basically I want Dataflow to accumulate some messages in a fixed amount of time (15 minutes for example), then write those data as text file into GCS when that amount of time has passed. My…

python google-cloud-dataflow apache-beam google-cloud-pubsub

asked Feb 18 '19 at 11:07

laurds

votes

3 answers

How to get apache beam for dataflow GCP on Python 3.x

I'm very newby with GCP and dataflow. However , I would like to start to test and deploy few flows harnessing dataflow on GCP. According to the documentation and everything around dataflow is imperative use the Apache project BEAM. Therefore and…

python google-cloud-platform google-cloud-dataflow apache-beam dataflow

asked Jan 24 '19 at 04:15

Andres Urrego Angel

1,842
7
29
55

votes

2 answers

Apache Beam Coder for GenericRecord

I am building a pipeline that reads Avro generic records. To pass GenericRecord between stages I need to register AvroCoder. The documentation says that if I use generic record, the schema argument can be arbitrary:…

google-cloud-dataflow avro apache-beam

asked Dec 13 '18 at 14:53

Nutel

2,244
2
27
50

votes

2 answers

Maven conflict in Java app with google-cloud-core-grpc dependency

(I've also raised a GitHub issue for this - https://github.com/googleapis/google-cloud-java/issues/4095) I have the latest versions of the following 2 dependencies for Apache Beam: Dependency 1 - google-cloud-dataflow-java-sdk-all (A…

java maven google-cloud-platform google-cloud-dataflow apache-beam

asked Nov 22 '18 at 05:15

Chris Halcrow

28,994
18
176
206

votes

1 answer

Image preprocessing with Dataflow

Task: I am to run an ETL job that will Extract TIFF images from GCS, Transform those images to text with a combination of open source computer vision tools such as OpenCV + Tesseract and ultimately Load the data into BigQuery Problem: I am trying…

python image-processing google-cloud-dataflow apache-beam

asked Nov 20 '18 at 15:12

Ryan Stack

1,231
1
12
25

votes

2 answers

Google DataFlow/Python: Import errors with save_main_session and custom modules in main

Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__. My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt and another one via setup_file. Unless I move…

python google-cloud-dataflow apache-beam

asked Jul 12 '18 at 17:21

kpax

votes

1 answer

autoscaling in Google Cloud Dataflow

We have a streaming pipeline that we have enabled autoscaling on. Generally, one worker is enough to process the incoming data, but we want to automatically increase the number of workers if there is a backlog. Our pipeline reads from Pubsub, and…

google-cloud-dataflow autoscaling

asked Jun 28 '18 at 17:27

Chris Heath

votes

1 answer

BigQueryIO.read().fromQuery performance slow

One of the things I've noticed is that the performance of BigQueryIO.read().fromQuery() is quite slower than the performance of BigQueryIO.read().from() in Apache Beam. Why does this happen? And is there any way to improve it?

google-bigquery google-cloud-dataflow apache-beam

asked Apr 18 '18 at 11:21

rish0097

1,024
2
18
39

votes

3 answers

Dataflow/ApacheBeam Limit input to the first X amount?

I have a bounded PCollection but i only want to get the first X amount of inputs and discard the rest. Is there a way to do this using Dataflow 2.X/ApacheBeam?

java google-cloud-dataflow apache-beam

asked Mar 30 '18 at 17:57

shockawave123

votes

1 answer

Controlling Dataflow/Apache Beam output sharding

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting…

python google-cloud-dataflow apache-beam

asked Mar 27 '18 at 18:22

Josh Sacks

Prev 1 2 3

…

99 100 Next