Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
9
votes
1 answer

Google Cloud Dataflow ETL (Datastore -> Transform -> BigQuery)

We have an application running on Google App Engine using Datastore as persistence back-end. Currently application has mostly 'OLTP' features and some rudimentary reporting. While implementing reports we experienced that processing large amount of…
9
votes
2 answers

Google Dataflow vs Apache Storm

Reading Google's Dataflow API, I have the impression that it is very similar to what Apache Storm does. Realtime data processing through pipelining flow. Unless I completely miss the point here, instead of building bridges on how to execute…
user1400995
8
votes
1 answer

Apache Beam Cloud Dataflow Streaming Stuck Side Input

I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery. Side pipeline code side_pipeline =…
fahmiduldul
  • 924
  • 7
  • 18
8
votes
1 answer

Go + Apache Beam GCP Dataflow: Could not find the sink for pubsub, Check that the sink library specifies alwayslink = 1

I am using the Go SDK with Apache Beam to build a simple Dataflow pipeline that will get data from a query and publish the data to pub/sub with the following code: package main import ( "context" "flag" …
8
votes
4 answers

Including another file in Dataflow Python flex template, ImportError

Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder? My project structure is like this: ├── pipeline │ ├── __init__.py │ ├── main.py │ ├──…
8
votes
3 answers

No module named airfow.gcp - how to run dataflow job that uses python3/beam 2.15?

When I go to use operators/hooks like the BigQueryHook I see a message that these operators are deprecated and to use the airflow.gcp... operator version. However when i try and use it in my dag it fails and says no module named airflow.gcp. I have…
WIT
  • 1,043
  • 2
  • 15
  • 32
8
votes
2 answers

Dataflow Template Cloud Pub/Sub Topic vs Subscription to BigQuery

I'm setting up a simple Proof of Concept to learn some of the concepts in Google Cloud, specifically PubSub and Dataflow. I have a PubSub topic greeting I've created a simple cloud function that sends publishes a message to that topic: const…
8
votes
1 answer

How to extract Google PubSub publish time in Apache Beam

My goal is to be able to access PubSub message Publish Time as recorded and set by Google PubSub in Apache Beam (Dataflow). PCollection pubsubMsg = pipeline.apply("Read Messages From PubSub", …
8
votes
3 answers

Apache Beam: DoFn.Setup equivalent in Python SDK

What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python. Is the best way currently to attach objects to threading.local() in…
Andreas Jansson
  • 3,137
  • 2
  • 30
  • 40
8
votes
2 answers

Throttling a step in beam application

I'm using python beam on google dataflow, my pipeline looks like this: Read image urls from file >> Download images >> Process images The problem is that I can't let Download images step scale as much as it needs because my application can get…
Xitrum
  • 7,765
  • 26
  • 90
  • 126
8
votes
1 answer

Dataflow/apache beam: manage custom module dependencies

I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this: ├── mymain.py └── myothermodule.py I import myothermodule.py in mymain.py like this: import myothermodule When I run…
mee
  • 688
  • 8
  • 18
8
votes
2 answers

Google Dataflow - Failed to import custom python modules

My Apache beam pipeline implements custom Transforms and ParDo's python modules which further imports other modules written by me. On Local runner this works fine as all the available files are available in the same path. In case of Dataflow runner,…
Karthik N
  • 97
  • 1
  • 5
8
votes
3 answers

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
8
votes
2 answers

Apache Beam - Unable to infer a Coder on a DoFn with multiple output tags

I am trying to execute a pipeline using Apache Beam but I get an error when trying to put some output tags: import com.google.cloud.Tuple; import com.google.gson.Gson; import com.google.gson.reflect.TypeToken; import…
Jac
  • 531
  • 1
  • 4
  • 19
8
votes
3 answers

Using custom docker containers in Dataflow

From this link I found that Google Cloud Dataflow uses Docker containers for its workers: Image for Google Cloud Dataflow instances I see it's possible to find out the image name of the docker container. But, is there a way I can get this docker…
Jonathan Sylvester
  • 1,275
  • 10
  • 23