Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
8
votes
2 answers

Missing object or bucket in path when running on Dataflow

When trying to run a pipeline on the Dataflow service, I specify the staging and temp buckets (in GCS) on the command line. When the program executes, I get a RuntimeException before my pipeline runs, where the root cause is that I'm missing…
Thomas Groh
  • 511
  • 3
  • 8
8
votes
6 answers

Permissions error with Apache Beam example on Google Dataflow

I'm having trouble submitting an Apache Beam example from a local machine to our cloud platform. Using gcloud auth list I can see that the correct account is currently active. I can use gsutil and the web client to interact with the file system. I…
8
votes
3 answers

Gradle Support for GCP Dataflow Templates?

According to Google's Dataflow documentation, Dataflow job template creation is "currently limited to Java and Maven." However, the documentation for Java across GCP's Dataflow site is... messy, to say the least. The 1.x and 2.x versions of Dataflow…
KristinaTracer
  • 108
  • 1
  • 10
8
votes
1 answer

Steps to create Cloud Dataflow template using the Python SDK

I have created Pipeline in Python using Apache Beam SDK, and Dataflow jobs are running perfectly from command-line. Now, I'd like to run those jobs from UI. For that i have to create template file for my job. I found steps to create template in Java…
Shilpa G
  • 79
  • 1
  • 3
8
votes
2 answers

Stream BigQuery table into Google Pub/Sub

I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic what should be the easy/fast way to do it? Thank you in advance,
8
votes
3 answers

Reading CSV header with Dataflow

I have a CSV file, and I don't know the column names ahead of time. I need to output the data in JSON after some transformations in Google Dataflow. What's the best way to take the header row and permeate the labels through all the rows? For…
Maximilian
  • 7,512
  • 3
  • 50
  • 63
8
votes
1 answer

Permissioning Dataflow to read a BigQuery table that is pointing to Drive?

BigQuery can read from Google Drive as a federated source. See here. I want to be able to read a table in BigQuery into my Dataflow pipeline that is pointing to a Drive document. Hooking up BigQuery to the file in Drive works perfectly fine: But,…
Graham Polley
  • 14,393
  • 4
  • 44
  • 80
8
votes
5 answers

Google Cloud Storage: Output path does not exist or is not writeable

I am trying to follow this simple Dataflow example from google cloud site. I have successfully installed the dataflow pipeline plugin and gcloud SDK (as well as Python 2.7). I have also set up a project on google cloud and enabled billing and all…
8
votes
3 answers

Access HTTP service running in GKE from Google Dataflow

I have an HTTP service running on a Google Container Engine cluster (behind a kubernetes service). My goal is to access that service from a Dataflow job running on the same GCP project using a fixed name (in the same way services can be reached from…
Thomas
  • 653
  • 5
  • 18
8
votes
1 answer

Complex join with google dataflow

I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples. I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection).…
Mike Smith
  • 81
  • 2
8
votes
1 answer

Writing Output of a Dataflow Pipeline to a Partitioned Destination

We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data…
Narek
  • 548
  • 6
  • 26
8
votes
2 answers

Insert PubSub messages into BigQuery through Google Cloud Dataflow

I would like to insert PubSub messages data coming from a topic into a BigQuery table using Google Cloud Dataflow. Everything works great but in the BigQuery table I can see unreadable strings like " ߈���". This is my…
8
votes
1 answer

Read files from a PCollection of GCS filenames in Pipeline?

I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process). Can I use TextIO? Can you use it in…
8
votes
3 answers

How to run Google Cloud Dataflow job from App Engine?

After reading Cloud Dataflow docs, I am still not sure how can I run my dataflow job from App Engine. Is it possible? Is it relevant whether my backend written in Python or in Java? Thanks!
deemson
  • 367
  • 4
  • 7
7
votes
2 answers

Optimising GCP costs for a memory-intensive Dataflow Pipeline

We want to improve the costs of running a specific Apache Beam pipeline (Python SDK) in GCP Dataflow. We have built a memory-intensive Apache Beam pipeline, which requires approximately 8.5 GB of RAM to be run on each executor. A large machine…