Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
6
votes
4 answers

GCP dataflow with python. "AttributeError: Can't get attribute '_JsonSink' on module 'dataflow_worker.start'

I am new in GCP dataflow. I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file). Here is my code However, I…
han shih
  • 389
  • 1
  • 5
  • 13
6
votes
1 answer

what actually manages watermarks in beam?

Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing. Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were not always getting ackd, and fixed windows that were…
drobert
  • 1,230
  • 8
  • 21
6
votes
1 answer

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
Pablo
  • 10,425
  • 1
  • 44
  • 67
6
votes
2 answers

AttributeError: 'module' object has no attribute 'ensure_str'

I try to transfer data from one bigquery to anther through Beam, however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception:…
zangw
  • 43,869
  • 19
  • 177
  • 214
6
votes
3 answers

Why does custom Python object cannot be used with ParDo Fn?

I'm currently new to using Apache Beam in Python with Dataflow runner. I'm interested in creating a batch pipeline that publishes to Google Cloud PubSub, I had tinkered with Beam Python APIs and found a solution. However, during my explorations, I…
dekauliya
  • 1,303
  • 2
  • 15
  • 26
6
votes
1 answer

How to send and filter structured logs from dataflow job

The goal is to store audit logging from different apps/jobs and be able to aggregate them by some ids. We chose to have BigQuery for that purpose and so we need to get a structured information from the logs to the BigQuery. We successfully use apps…
6
votes
1 answer

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings windows (currently the window size is 3 seconds and…
Sven.DG
  • 295
  • 1
  • 13
6
votes
2 answers

Exception Handling in Apache Beam pipelines using Python

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output |…
6
votes
1 answer

How to solve Duplicate values exception when I create PCollectionView>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception :…
6
votes
1 answer

Why do I need to shuffle my PCollection for it to autoscale on Cloud Dataflow?

Context I am reading a file from Google Storage in Beam using a process that looks something like this: data = pipeline | beam.Create(['gs://my/file.pkl']) | beam.ParDo(LoadFileDoFn) Where LoadFileDoFn loads the file and creates a Python list of…
bstr
  • 63
  • 3
6
votes
2 answers

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0 We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs…
Tim
  • 2,667
  • 4
  • 32
  • 39
6
votes
2 answers

Google Dataflow "No filesystem found for scheme gs"

I'm trying to execute a Google Dataflow Application, but it is throw this Exception java.lang.IllegalArgumentException: No filesystem found for scheme gs at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:459) at…
6
votes
2 answers

Writing to text files in Apache Beam / Dataflow Python streaming

I have a very basic Python Dataflow job that reads some data from Pub/Sub, applies a FixedWindow and writes to Google Cloud Storage. transformed = ... transformed | beam.io.WriteToText(known_args.output) The output is written to the location…
6
votes
2 answers

Write BigQuery results to GCS in CSV format using Apache Beam

I am pretty new working on Apache Beam , where in I am trying to write a pipeline to extract the data from Google BigQuery and write the data to GCS in CSV format using Python. Using beam.io.read(beam.io.BigQuerySource()) I am able to read the data…
Hari
  • 111
  • 1
  • 9
6
votes
0 answers

Pass JVM arguments to Google Cloud Dataflow

We are running our Apache Beam code in Google cloud dataflow, we need to pass some JVM argument to our program. We found links related to Execution Parameters, but nothing related to JVM arguments. How to pass the JVM arguments to Google Cloud…
SANN3
  • 9,459
  • 6
  • 61
  • 97