Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
13
votes
1 answer

Using SSH Key on Dataflow workers to pull private library

I'm setting up a dataflow job and for this job the workers need access to a private bitbucket repository to install a library to process the data. In order to grant access to the dataflow workers, I have set up a pair of SSH keys (public & private).…
Sven.DG
  • 295
  • 1
  • 13
13
votes
1 answer

How to import a CSV file into a BigQuery table without any column names or schema?

I'm currently writing a Java utility to import few CSV files from GCS into BigQuery. I can easily achieve this by bq load, but I wanted to do it using a Dataflow job. So I'm using Dataflow's Pipeline and ParDo transformer (returns TableRow to apply…
Vijin Paulraj
  • 4,469
  • 5
  • 39
  • 54
13
votes
1 answer

When does Dataflow acknowledge a message of batched items from PubSubIO?

There has been a question on this topic, the answer said "The acknowledgement will be made once the message is durable persisted somewhere in the Dataflow pipeline.". Conceptually, that makes sense, but I am not sure how Dataflow is capable of…
M Song
  • 280
  • 2
  • 9
13
votes
1 answer

How to combine streaming data with large history data set in Dataflow/Beam

I am investigating processing logs from web user sessions via Google Dataflow/Apache Beam and need to combine the user's logs as they come in (streaming) with the history of a user's session from the last month. I have looked at the following…
Florian
  • 133
  • 1
  • 5
13
votes
0 answers

How can I specify the number of workers for my Dataflow?

I have an Apache Beam pipeline that loads a large import file of around 90GB. I've written the pipeline in the Apache Beam Java SDK. Using the default settings for PipelineOptionsFactory, my job takes quite a while to complete. How can I control,…
Alex Harvey
  • 215
  • 3
  • 8
12
votes
3 answers

Dataflow Pipeline - "Processing stuck in step for at least

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are…
Caio Riva
  • 141
  • 1
  • 4
12
votes
2 answers

How do I restart a cancelled Cloud Dataflow streaming job?

I've created a standard PubSub to BigQuery dataflow. However, in order to ensure I wasn't going to run up a huge bill while offline, I cancelled the dataflow. From the GCP console, there doesn't seem to be an option to restart it - is this…
Paul Michaels
  • 16,185
  • 43
  • 146
  • 269
12
votes
1 answer

Apache Beam in Dataflow Large Side Input

This is most similar to this question. I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and…
Taylor
  • 176
  • 1
  • 9
12
votes
1 answer

import apache_beam metaclass conflict

When I try to import apache beam I get the following error. >>> import apache_beam Traceback (most recent call last): File "", line 1, in File "/home/toor/pfff/local/lib/python2.7/site-packages/apache_beam/__init__.py", line 78,…
12
votes
1 answer

Source Vs PTransform

I am new to the project, and I am trying to create a connector between Dataflow and a database. The documentation clearly states that I should use a Source and a Sink but I see a lot of people using directly a PTransform associated with a PInput or…
pibafe
  • 123
  • 6
11
votes
4 answers

What's the difference between "serverless" and "fully managed"?

According to Google Cloud documentation, Cloud Dataflow is serverless while Cloud Firestore is fully-managed. If serverless means that the infrastructure and resources are managed by the cloud provider. Then what's the difference between these two…
11
votes
3 answers

Google Cloud Dataflow v/s Google Cloud Data Fusion

I recently saw that there is a new tool in GCP known as Data Fusion and looking at it, it seems like it is an easier way of creating ETL pipelines as compared to Dataflow. So can we assume that it is a replacement for Dataflow?
rish0097
  • 1,024
  • 2
  • 18
  • 39
11
votes
2 answers

Network default is not accessible to Dataflow Service account

Having issues starting a Dataflow job(2018-07-16_04_25_02-6605099454046602382) in a project without a local VPC Network when I get this error Workflow failed. Causes: Network default is not accessible to Dataflow Service account There is a shared…
Brodin
  • 197
  • 1
  • 1
  • 14
11
votes
5 answers

How to integrate Google Cloud SQL with Google Big Query

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and…
11
votes
1 answer

Apache Beam - Integration test with unbounded PCollection

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context... Details about our pipeline: We use PubsubIO as our data source (unbounded PCollection) Intermediate transforms include a…