Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
7
votes
1 answer

How to create groups of N elements from a PCollection Apache Beam Python

I am trying to accomplish something like this: Batch PCollection in Beam/Dataflow The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction. Specifically I have…
7
votes
1 answer

Writing nested schema to BigQuery from Dataflow (Python)

I have a Dataflow job to write to BigQuery. It works well for non-nested schema, however fails for the nested schema. Here is my Dataflow pipeline: pipeline_options = PipelineOptions() p = beam.Pipeline(options=pipeline_options) …
7
votes
1 answer

apache_beam.transforms.util.Reshuffle() not available for GCP Dataflow

I have upgraded to the latest apache_beam[gcp] package via pip install --upgrade apache_beam[gcp]. However, I noticed that Reshuffle() does not appear in the [gcp] distribution. Does this mean that I will not be able to use Reshuffle() in any…
7
votes
3 answers

How to read BigQuery table using python pipeline code in GCP Dataflow

Could someone please share syntax to read/write bigquery table in a pipeline written in python for GCP Dataflow
7
votes
1 answer

Explain Cost of Google Cloud PubSub when used with Cloud Dataflow

The documentation on pubsub pricing is very minimal. Can someone explain the costs for the scenario below ? Size of the data per event = 0.5 KB Size of data per day = 1 TB There is only one publisher app and there are two dataflow pipeline…
mmziyad
  • 298
  • 1
  • 4
  • 16
7
votes
2 answers

Watching for new files matching a filepattern in Apache Beam

I have a directory on GCS or another supported filesystem to which new files are being written by an external process. I would like to write an Apache Beam streaming pipeline that continuously watches this directory for new files and reads and…
jkff
  • 17,623
  • 5
  • 53
  • 85
7
votes
2 answers

Google Cloud Dataflow Worker Threading

Say we have one worker with 4 CPU cores. How does parallelism configured in Dataflow worker machines? Do we parallelize beyond # of cores?
user_1357
  • 7,766
  • 13
  • 63
  • 106
7
votes
2 answers

java.lang.IllegalStateException: Unable to return a default Coder in dataflow 2.X

I have a simple pipeline in dataflow 2.1 sdk. Which reads data from pubsub then applies a DoFn to it. PCollection e = streamData.apply("ToE", ParDo.of(new MyDoFNClass())); Getting below error on this…
PUG
  • 4,301
  • 13
  • 73
  • 115
7
votes
1 answer

How do you test Beam pipeline (Google Dataflow) in Python?

I am having understanding how we are supposed to test our pipeline using Google DataFlow(based on Apache Beam) Python SDK.…
codebrotherone
  • 541
  • 6
  • 22
7
votes
2 answers

Invalid DateTime error while trying to insert datetime value into BigQuery from Dataflow

We wrote a Google Data Flow code that inserts a value into a bigquery table whose column is of type DateTime. The logic was running fine most of the times. But suddenly we get Invalid DateTime issue. Exception: java.lang.RuntimeException:…
7
votes
2 answers

Logs for Beam application in Google cloud dataflow

I have a Beam application that runs successfully locally with directrunner and gives me all the log information i have in my code on my local console. But when I tried running it in the google cloud dataflow environment, i only see those log…
bignano
  • 573
  • 5
  • 21
7
votes
1 answer

Read a file from GCS in Apache Beam

I need to read a file from a GCS bucket. I know I'll have to use GCS API/Client Libraries but I cannot find any example related to it. I have been referring to this link in the GCS documentation: GCS Client Libraries. But couldn't really make a…
rish0097
  • 1,024
  • 2
  • 18
  • 39
7
votes
3 answers

Connecting to Cloud SQL from Dataflow Job

I'm struggling to use JdbcIO with Apache Beam 2.0 (Java) to connect to a Cloud SQL instance from Dataflow within the same project. I'm getting the following error: java.sql.SQLException: Cannot create PoolableConnectionFactory (Communications link…
Jimmy
  • 165
  • 1
  • 1
  • 12
7
votes
2 answers

Apache Beam: Unable to find registrar for gs

Beam is using both Google's auto/value and auto/service tools. I want to run a pipeline with Dataflow runner and data is stored on Google Cloud Storage. I've added a dependencies: org.apache.beam
7
votes
2 answers

Buffer and flush Apache Beam streaming data

I have a streaming job that with initial run will have to process large amount of data. One of DoFn calls remote service that supports batch requests, so when working with bounded collections I use following approach: private static final class…
robosoul
  • 757
  • 7
  • 17