Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
7
votes
1 answer

Is there anyway I can use preemptible instance for dataflow jobs?

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I…
7
votes
1 answer

Kafka cluster loses or duplicates messages

While working to adapt Java's KafkaIOIT to work with a large dataset I encountered a problem. I want to push 100M records through a Kafka topic, verify data correctness and at the same time check the performance of KafkaIO.Write and KafkaIO.Read. To…
7
votes
2 answers

Problem in specifying the network in cloud dataflow

I didn't configure the project and I get this error whenever I run my job 'The network default doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set…
Rim
  • 1,735
  • 2
  • 18
  • 29
7
votes
1 answer

How to make a generic Protobuf Parser DoFn in python beam?

Context I'm working with a streaming pipeline which has a protobuf data source in pubsub. I wish to parse this protobuf into a python dict because the data sink requires the input to be a collection of dicts. I had developed a Protobuf Parser…
7
votes
2 answers

Issues with Dynamic Destinations in Dataflow

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's…
7
votes
2 answers

How to create a Dataflow pipeline from Pub/Sub to GCS in Python

I want to use Dataflow to move data from Pub/Sub to GCS. So basically I want Dataflow to accumulate some messages in a fixed amount of time (15 minutes for example), then write those data as text file into GCS when that amount of time has passed. My…
7
votes
3 answers

How to get apache beam for dataflow GCP on Python 3.x

I'm very newby with GCP and dataflow. However , I would like to start to test and deploy few flows harnessing dataflow on GCP. According to the documentation and everything around dataflow is imperative use the Apache project BEAM. Therefore and…
7
votes
2 answers

Apache Beam Coder for GenericRecord

I am building a pipeline that reads Avro generic records. To pass GenericRecord between stages I need to register AvroCoder. The documentation says that if I use generic record, the schema argument can be arbitrary:…
Nutel
  • 2,244
  • 2
  • 27
  • 50
7
votes
2 answers

Maven conflict in Java app with google-cloud-core-grpc dependency

(I've also raised a GitHub issue for this - https://github.com/googleapis/google-cloud-java/issues/4095) I have the latest versions of the following 2 dependencies for Apache Beam: Dependency 1 - google-cloud-dataflow-java-sdk-all (A…
7
votes
1 answer

Image preprocessing with Dataflow

Task: I am to run an ETL job that will Extract TIFF images from GCS, Transform those images to text with a combination of open source computer vision tools such as OpenCV + Tesseract and ultimately Load the data into BigQuery Problem: I am trying…
Ryan Stack
  • 1,231
  • 1
  • 12
  • 25
7
votes
2 answers

Google DataFlow/Python: Import errors with save_main_session and custom modules in __main__

Could somebody please clarify the expected behavior when using save_main_session and custom modules imported in __main__. My DataFlow pipeline imports 2 non-standard modules - one via requirements.txt and another one via setup_file. Unless I move…
kpax
  • 621
  • 1
  • 8
  • 18
7
votes
1 answer

autoscaling in Google Cloud Dataflow

We have a streaming pipeline that we have enabled autoscaling on. Generally, one worker is enough to process the incoming data, but we want to automatically increase the number of workers if there is a backlog. Our pipeline reads from Pubsub, and…
Chris Heath
  • 136
  • 1
  • 7
7
votes
1 answer

BigQueryIO.read().fromQuery performance slow

One of the things I've noticed is that the performance of BigQueryIO.read().fromQuery() is quite slower than the performance of BigQueryIO.read().from() in Apache Beam. Why does this happen? And is there any way to improve it?
rish0097
  • 1,024
  • 2
  • 18
  • 39
7
votes
3 answers

Dataflow/ApacheBeam Limit input to the first X amount?

I have a bounded PCollection but i only want to get the first X amount of inputs and discard the rest. Is there a way to do this using Dataflow 2.X/ApacheBeam?
shockawave123
  • 699
  • 5
  • 15
7
votes
1 answer

Controlling Dataflow/Apache Beam output sharding

We've found experimentally that setting an explicit # of output shards in Dataflow/Apache Beam pipelines results in much worse performance. Our evidence suggests that Dataflow secretly does another GroupBy at the end. We've moved to letting…
Josh Sacks
  • 93
  • 1
  • 6