Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
11
votes
2 answers

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo. How can I do this using the Apache Beam BigQueryIO API?
jkff
  • 17,623
  • 5
  • 53
  • 85
11
votes
4 answers

FTP to Google Storage

Some files get uploaded on a daily basis to an FTP server and I need those files under Google Cloud Storage. I don't want to bug the users that upload the files to install any additional software and just let them keep using their FTP client. Is…
11
votes
2 answers

Partition data coming from CSV so I can process larger patches rather then individual lines

I am just getting started with Google Data Flow, I have written a simple flow that reads a CSV file from cloud storage. One of the steps involves calling a web service to enrich results. The web service in question performs much better when…
Jeffrey Ellin
  • 586
  • 1
  • 3
  • 17
11
votes
2 answers

How to fix Dataflow unable to serialize my DoFn?

When I run my Dataflow pipeline, I get the exception below complaining that my DoFn can't be serialized. How do I fix this? Here's the stack trace: Caused by: java.lang.IllegalArgumentException: unable to serialize…
Jeremy
  • 2,249
  • 2
  • 17
  • 18
10
votes
2 answers

Error with installing apache-beam[gcp] on mac zsh terminal - “zsh: no matches found: apache-beam[gcp]”

I am using zsh, and I have installed gcloud in order to interact with GCP via local terminal on my Mac. I am encountering this error “zsh: no matches found: apache-beam[gcp]”. However, when I run the command directly on the bash terminal on the GCP…
10
votes
5 answers

Kotlin Iterable not supported in Apache Beam?

Apache beam seems to be refusing to recognise Kotlin's Iterable. Here is a sample code: @ProcessElement fun processElement( @Element input: KV>, receiver: OutputReceiver ) { val output = input.key + "|" +…
marcoseu
  • 3,892
  • 2
  • 16
  • 35
10
votes
2 answers

Google Cloud Data flow stuck with repeated error 'Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff'

SDK: Apache Beam SDK for Go 0.5.0 Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure…
Tim
  • 2,667
  • 4
  • 32
  • 39
10
votes
1 answer

Read Files from multiple folders in Apache Beam and map outputs to filenames

Working on reading files from multiple folders and then output the file contents with the file name like (filecontents, filename) to bigquery in apache beam using the python sdk and a dataflow runner. Originally thought I could create A pcollection…
10
votes
3 answers

Start CloudSQL Proxy on Python Dataflow / Apache Beam

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a Dataflow template which I can start from a…
10
votes
1 answer

Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing?

We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner). Our flow is as follows: Pull data from pub/sub Deserialize JSON into Java object Window events w/ fixed windows of 5 seconds Using a custom CombineFn, combine…
Chris Staikos
  • 1,150
  • 10
  • 24
10
votes
2 answers

ClassNotFound exception when attempting to use DataflowRunner

I'm trying to launch a Dataflow job on GCP using Apache Beam 0.6.0. I am compiling an uber jar using the shade plugin because I cannot launch the job using "mvn:execjava". I'm including this dependency:
10
votes
1 answer

What does object of type '_UnwindowedValues' has no len() mean?

I'm using Dataflow 0.5.5 Python. Ran into the following error in very simple code: print(len(row_list)) row_list is a list. Exactly the same code, same data and same pipeline runs perfectly fine on DirectRunner, but throws the following exception…
foxwendy
  • 2,819
  • 2
  • 28
  • 50
10
votes
3 answers

How to convert csv into a dictionary in apache beam dataflow

I would like to read a csv file and write it to BigQuery using apache beam dataflow. In order to do this I need to present the data to BigQuery in the form of a dictionary. How can I transform the data using apache beam in order to do this? My input…
10
votes
6 answers

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK. Looking at the BigQuery JSON API, to create a day partitioned table one needs…
ptf
  • 115
  • 1
  • 6
10
votes
1 answer

Architecture of complex Dataflow jobs

We are building rather complex Dataflow jobs in that compute models from a streaming source. In particular, we have two models that share a bunch of metrics and that are computed off roughly the same data source. The jobs perform joins on slightly…
Thomas
  • 653
  • 5
  • 18