Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

votes

4 answers

GCP dataflow with python. "AttributeError: Can't get attribute '_JsonSink' on module 'dataflow_worker.start'

I am new in GCP dataflow. I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file). Here is my code However, I…

asked Oct 16 '19 at 09:46

han shih

votes

1 answer

what actually manages watermarks in beam?

Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing. Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were not always getting ackd, and fixed windows that were…

google-cloud-dataflow apache-beam

asked Oct 03 '19 at 14:32

drobert

1,230
8
21

votes

1 answer

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?

google-cloud-dataflow apache-beam

asked Sep 26 '19 at 06:02

Pablo

10,425
1
44
67

votes

2 answers

AttributeError: 'module' object has no attribute 'ensure_str'

I try to transfer data from one bigquery to anther through Beam, however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception:…

python google-cloud-dataflow apache-beam

asked Jul 29 '19 at 10:01

zangw

43,869
19
177
214

votes

3 answers

Why does custom Python object cannot be used with ParDo Fn?

I'm currently new to using Apache Beam in Python with Dataflow runner. I'm interested in creating a batch pipeline that publishes to Google Cloud PubSub, I had tinkered with Beam Python APIs and found a solution. However, during my explorations, I…

python google-cloud-dataflow apache-beam

asked Apr 24 '19 at 05:21

dekauliya

1,303
2
15
26

votes

1 answer

How to send and filter structured logs from dataflow job

The goal is to store audit logging from different apps/jobs and be able to aggregate them by some ids. We chose to have BigQuery for that purpose and so we need to get a structured information from the logs to the BigQuery. We successfully use apps…

google-cloud-dataflow logback apache-beam google-cloud-stackdriver

asked Apr 18 '19 at 07:51

Adam Horky

votes

1 answer

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings windows (currently the window size is 3 seconds and…

google-cloud-dataflow apache-beam dataflow

asked Feb 20 '19 at 10:30

Sven.DG

votes

2 answers

Exception Handling in Apache Beam pipelines using Python

I'm doing a simple pipeline using Apache Beam in python (on GCP Dataflow) to read from PubSub and write on Big Query but can't handle exceptions on pipeline to create alternatives flows. On a simple WriteToBigQuery example: output = json_output |…

python google-cloud-dataflow apache-beam dataflow

asked Jan 29 '19 at 17:00

Marcelo Santino

votes

1 answer

How to solve Duplicate values exception when I create PCollectionView>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode. But it always meets Exception :…

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Jan 29 '19 at 13:46

c1mone

votes

1 answer

Why do I need to shuffle my PCollection for it to autoscale on Cloud Dataflow?

Context I am reading a file from Google Storage in Beam using a process that looks something like this: data = pipeline | beam.Create(['gs://my/file.pkl']) | beam.ParDo(LoadFileDoFn) Where LoadFileDoFn loads the file and creates a Python list of…

python google-cloud-dataflow apache-beam

asked Dec 21 '18 at 13:06

bstr

votes

2 answers

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0 We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs…

go google-cloud-dataflow apache-beam

asked Dec 17 '18 at 22:07

Tim

2,667
4
32
39

votes

2 answers

Google Dataflow "No filesystem found for scheme gs"

I'm trying to execute a Google Dataflow Application, but it is throw this Exception java.lang.IllegalArgumentException: No filesystem found for scheme gs at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:459) at…

java exception google-cloud-platform google-cloud-storage google-cloud-dataflow

asked Dec 13 '18 at 11:45

William Miranda de Jesus

votes

2 answers

Writing to text files in Apache Beam / Dataflow Python streaming

I have a very basic Python Dataflow job that reads some data from Pub/Sub, applies a FixedWindow and writes to Google Cloud Storage. transformed = ... transformed | beam.io.WriteToText(known_args.output) The output is written to the location…

google-cloud-storage google-cloud-dataflow apache-beam google-cloud-pubsub

asked Nov 19 '18 at 17:10

Daniel Messias

2,623
2
18
21

votes

2 answers

Write BigQuery results to GCS in CSV format using Apache Beam

I am pretty new working on Apache Beam , where in I am trying to write a pipeline to extract the data from Google BigQuery and write the data to GCS in CSV format using Python. Using beam.io.read(beam.io.BigQuerySource()) I am able to read the data…

python google-bigquery google-cloud-dataflow apache-beam

asked Oct 22 '18 at 12:27

Hari

votes

0 answers

Pass JVM arguments to Google Cloud Dataflow

We are running our Apache Beam code in Google cloud dataflow, we need to pass some JVM argument to our program. We found links related to Execution Parameters, but nothing related to JVM arguments. How to pass the JVM arguments to Google Cloud…

google-cloud-platform google-cloud-dataflow jvm-arguments

asked Oct 19 '18 at 04:24

SANN3

9,459
6
61
97

Prev 1 2 3

…

99 100 Next