Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
1
vote
1 answer

How to set FileIO writeDynamic name with input fields?

I'm using Dataflow to load some csv to Google Cloud Storage and I need to save some CSV files into different directories based on data values (like uuid, region, etc.). How can I do this? Currently I'm able to add the key (from KV) in the path but I…
1
vote
1 answer

Dataflow Pipeline workers stall when passing extra arguments in PipelineOptions

I have a Dataflow job defined in Apache Beam that works fine normally but breaks when I attempt to include all of my custom command line options in the PipelineOptions that I pass to beam.Pipeline(options=pipeline_options). It fails after the graph…
Anthony Naddeo
  • 2,497
  • 25
  • 28
1
vote
1 answer

Why is Apache Beam `DoFn.setup()` called more then once after worker startup?

I am currently experimenting with a streaming Dataflow pipeline (in Python). I read a stream of data which I like to write into a PG CloudSQL instance. To do so, I am looking for a proper place to create the database connection. As I am writing the…
1
vote
0 answers

BrokenPipeError between DataFlow and BigQuery using Python Apache Beam

We have a pipeline that is using Apache Beam in DataFlow. We have the same configuration in multiple regions that works perfectly but in europe-west1 we have connection problems between DataFlow and BigQuery. Here is the exception we're…
1
vote
1 answer

Pros and Cons of Google Dataflow VS Cloud Run while pulling data from HTTP endpoint

This is a design approach question where we are trying to pick the best option between Apache Beam / Google Dataflow and Cloud Run to pull data from HTTP endpoints (source) and put them down the stream to Google BigQuery (sink). Traditionally we…
1
vote
1 answer

Write to dynamic collection in MongoDB, with the Apache Beam SDK in Python

I have a query with the coll parameter in apache_beam.io.WriteToMongoDB Pcoll | "Write to Mongo" >> apache_beam.io.WriteToMongoDB( uri='someUri', db='someDb', coll='someColl', batch_size=10 …
1
vote
2 answers

GCS and Java- Concatenate dynamic bucket name and file name in TestIO

I want to write a file to a GCS bucket. The bucket path and file name are dynamically provided in two different pipeline options. How can I concatenate those in TextIO to write the file to the GCS bucket. I tried doing this but no…
pas
  • 109
  • 1
  • 13
1
vote
1 answer

Dataflow writing to datastore poor performance?

Lately I have updated my dataflow apache beam pipeline to the latest version, my pipeline writes a huge amount of data. The pipeline before apache beam version update from 2.27 to 2.41 takes about 8 min to finish executing while after the update it…
1
vote
0 answers

Dataflow job create from template failing with error: Workflow failed

I am trying to get started with Cloud dataflow by running the simple Wordcount template, but the job is failing without any error reason. In the past iterations, I did see informative errors on dataflow jobs, which I was able to fix, but now my jobs…
rg687
  • 23
  • 3
1
vote
0 answers

How to parallelize properly dataflow job over 11M of files stored on GCS using fileio.MatchFiles(file_pattern = 'gs://bucket/**/*')

I have a beginner question. I have ~11millions files stored on some GCS bucket with the following structure in a given bucket: yyyy/mm/dd I have 6 years of data with ~1500 files per years. lines = (p | "ReadInputData" >>…
1
vote
1 answer

No module named 'sentry_sdk' in Python GCP dataflow uploaded by terraform

I am uploading GCP dataflow with terraform. My terraform build, deploy and run dataflow resource "null_resource" "create_env" { provisioner "local-exec" { command = "python3 -m venv venv && venv/bin/pip install wheel 'apache-beam[gcp]'…
jereczq22
  • 29
  • 5
1
vote
1 answer

Apache Beam Streaming unable to write to BigQuery column-based partition

I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below. Using…
1
vote
2 answers

Dataflow streaming job processes the same element many times at the same time

Short description: Dataflow is processing the same input element many times, even at the same time in parallel (so this is not fail-retry build-in mechanism of dataflow, because previous process didn't fail). Long description: Pipeline gets pubsub…
Pav3k
  • 869
  • 4
  • 10
1
vote
1 answer

GCP Data Ingestion Architecture

I am going to start working on GCP Data ingestions to BigQuery from CSVs and Data lakes and I am looking for your advice what are the technologies or architecture that I can use. I am new to GCP but I have good understanding regarding Data…
CarlRoy
  • 11
  • 1
1
vote
1 answer

How does the dataflow authenticate the worker service account?

I created a service account in project A to use as a worker service account for dataflow. I specify the worker service account in dataflow's options I've looked for an dataflow's option to specify Service account keys for the worker service account,…