Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

votes

1 answer

Using SSH Key on Dataflow workers to pull private library

I'm setting up a dataflow job and for this job the workers need access to a private bitbucket repository to install a library to process the data. In order to grant access to the dataflow workers, I have set up a pair of SSH keys (public & private).…

bitbucket google-cloud-dataflow ssh-keys

asked May 24 '19 at 07:01

Sven.DG

votes

1 answer

How to import a CSV file into a BigQuery table without any column names or schema?

I'm currently writing a Java utility to import few CSV files from GCS into BigQuery. I can easily achieve this by bq load, but I wanted to do it using a Dataflow job. So I'm using Dataflow's Pipeline and ParDo transformer (returns TableRow to apply…

java csv google-bigquery google-cloud-dataflow

asked Aug 18 '17 at 01:57

Vijin Paulraj

4,469
5
39
54

votes

1 answer

When does Dataflow acknowledge a message of batched items from PubSubIO?

There has been a question on this topic, the answer said "The acknowledgement will be made once the message is durable persisted somewhere in the Dataflow pipeline.". Conceptually, that makes sense, but I am not sure how Dataflow is capable of…

google-cloud-dataflow

asked Jan 18 '17 at 18:44

M Song

votes

1 answer

How to combine streaming data with large history data set in Dataflow/Beam

I am investigating processing logs from web user sessions via Google Dataflow/Apache Beam and need to combine the user's logs as they come in (streaming) with the history of a user's session from the last month. I have looked at the following…

java apache-flink google-cloud-dataflow apache-beam

asked Apr 29 '16 at 00:21

Florian

votes

0 answers

How can I specify the number of workers for my Dataflow?

I have an Apache Beam pipeline that loads a large import file of around 90GB. I've written the pipeline in the Apache Beam Java SDK. Using the default settings for PipelineOptionsFactory, my job takes quite a while to complete. How can I control,…

google-cloud-dataflow apache-beam

asked Jan 19 '15 at 22:16

Alex Harvey

votes

3 answers

Dataflow Pipeline - "Processing stuck in step for at least without outputting or completing in state finish..."

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are…

google-cloud-dataflow apache-beam

asked Mar 04 '19 at 19:39

Caio Riva

votes

2 answers

How do I restart a cancelled Cloud Dataflow streaming job?

I've created a standard PubSub to BigQuery dataflow. However, in order to ensure I wasn't going to run up a huge bill while offline, I cancelled the dataflow. From the GCP console, there doesn't seem to be an option to restart it - is this…

google-cloud-platform google-cloud-dataflow

asked Jan 03 '18 at 18:03

Paul Michaels

16,185
43
146
269

votes

1 answer

Apache Beam in Dataflow Large Side Input

This is most similar to this question. I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and…

java google-bigquery google-cloud-dataflow apache-beam

asked Nov 27 '17 at 19:09

Taylor

votes

1 answer

import apache_beam metaclass conflict

When I try to import apache beam I get the following error. >>> import apache_beam Traceback (most recent call last): File "", line 1, in File "/home/toor/pfff/local/lib/python2.7/site-packages/apache_beam/__init__.py", line 78,…

python-2.7 google-cloud-dataflow apache-beam

asked Sep 19 '17 at 12:12

Tijl Vandevyvere

votes

1 answer

Source Vs PTransform

I am new to the project, and I am trying to create a connector between Dataflow and a database. The documentation clearly states that I should use a Source and a Sink but I see a lot of people using directly a PTransform associated with a PInput or…

java google-cloud-dataflow

asked Jan 11 '16 at 15:16

pibafe

votes

4 answers

What's the difference between "serverless" and "fully managed"?

According to Google Cloud documentation, Cloud Dataflow is serverless while Cloud Firestore is fully-managed. If serverless means that the infrastructure and resources are managed by the cloud provider. Then what's the difference between these two…

google-cloud-platform google-cloud-firestore google-cloud-dataflow serverless

asked Dec 09 '19 at 13:01

Rim

1,735
2
18
29

votes

3 answers

Google Cloud Dataflow v/s Google Cloud Data Fusion

I recently saw that there is a new tool in GCP known as Data Fusion and looking at it, it seems like it is an easier way of creating ETL pipelines as compared to Dataflow. So can we assume that it is a replacement for Dataflow?

google-cloud-platform google-cloud-dataflow

asked Jul 09 '19 at 07:01

rish0097

1,024
2
18
39

votes

2 answers

Network default is not accessible to Dataflow Service account

Having issues starting a Dataflow job(2018-07-16_04_25_02-6605099454046602382) in a project without a local VPC Network when I get this error Workflow failed. Causes: Network default is not accessible to Dataflow Service account There is a shared…

google-cloud-platform google-cloud-dataflow

asked Jul 16 '18 at 13:06

Brodin

votes

5 answers

How to integrate Google Cloud SQL with Google Big Query

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and…

google-bigquery google-cloud-platform google-cloud-sql google-cloud-dataflow

asked Sep 22 '17 at 17:13

Dhruv Rai Puri

1,335
11
20

votes

1 answer

Apache Beam - Integration test with unbounded PCollection

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context... Details about our pipeline: We use PubsubIO as our data source (unbounded PCollection) Intermediate transforms include a…

java integration-testing google-cloud-dataflow google-cloud-pubsub apache-beam

asked Jun 23 '17 at 18:28

Chris Staikos

1,150
10
24

Prev 1

…

99 100 Next