Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

vote

1 answer

Sharing Beam State across different DoFns

Is Beam State shared across different DoFns? Lets say I have 2 DoFns: StatefulDoFn1: { myState.write(1)} StatefulDoFn2: { myState.read() ; do something ... output} And then the pipeline in pseudocode: pipline =…

asked Sep 23 '22 at 13:56

StaticBug

vote

1 answer

How to convert a non-templated beam job to templated job and run it on GCP Dataflow runner?

I was able to run the non-templated beam job directly on the GCP dataflow runner by using the below command : java -jar --runner=DataFlowRunner --gcpTempLocation=gs://some/gcs/location --stagingLocation=gs://some/gcs/location/stage…

google-cloud-platform google-cloud-dataflow apache-beam

asked Sep 19 '22 at 07:06

Siddhanta Rath

vote

1 answer

Can not sink to BigQuery using Dataflow Apache Beam

I have 2 csv files: expeditions- 2010s.csv and peaks.csv with the join key 'peak_id'. I'm using notebook with Apache Beam in Dataflow to join them. Here is my code as below def read_csv_file(readable_file): import apache_beam as beam import…

google-bigquery google-cloud-dataflow apache-beam

asked Sep 19 '22 at 03:44

Nhu Dao

vote

1 answer

Apache Beam: IllegalStateException - Value only available at runtime after upgrading to beam 2.41.0

I upgraded my Apache Beam version from 2.34.0 to 2.41.0 and getting the following error when trying to build the template. The error: Exception in thread "main" java.lang.IllegalStateException: Value only available at runtime, but accessed from a…

java google-cloud-dataflow apache-beam google-cloud-spanner

asked Sep 18 '22 at 17:52

jurl

2,504
1
17
20

vote

2 answers

Google Dataflow store to specific Partition using BigQuery Storage Write API

I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date. I…

google-bigquery google-cloud-dataflow google-bigquery-storage-api

asked Sep 12 '22 at 20:17

gkatzioura

2,655
2
26
39

vote

1 answer

Avoid session shutdown on BigQuery Storage API with Dataflow

I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one. To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from. In order to route the BigQuery writes to the right…

google-bigquery google-cloud-dataflow google-bigquery-storage-api

asked Sep 12 '22 at 20:04

gkatzioura

2,655
2
26
39

vote

1 answer

What does the pipeline "state" mean in DataFlow?

I am a beginner in Dataflow. There is a concept I'm not sure I understand and this is the "state". When talking about the pipeline state, does it mean the data in the pipeline ? For example, when taking a DataFlow snapshot, the documentation says…

google-cloud-dataflow

asked Sep 11 '22 at 12:26

user1021712

vote

1 answer

How to override Google DataFlow logging with logback?

We deployed DataFlows in Google Cloud. Dataflows are developed using Apache Beam. Dataflow logging doesn't include the transaction id, which is needed for tracing the transaction in the pipeline. Any logging pattern used in the logback is being…

google-cloud-dataflow apache-beam logback

asked Sep 09 '22 at 02:43

Srinivas Reddy

vote

2 answers

Left join with CoGroupByKey sink to BigQuery using Dataflow

I would like to join files (expeditions- 2010s.csv and peaks.csv) using join key "peakid" with CoGroupByKey. However, there is an error when I sink it to BigQuery: RuntimeError: BigQuery job…

python-3.x google-bigquery google-cloud-storage google-cloud-dataflow apache-beam

asked Sep 08 '22 at 02:08

Nhu Dao

vote

1 answer

Apache Beam: Reading from Kafka starting at the initial offset rather than the latest

I am trying to write a simple Beam pipeline that starts consuming data from the earliest offsets existing in the partitions of each Kafka Topic. I have not been able to figure out how to consume data from the earliest possible offsets in a topic.

apache-kafka google-cloud-dataflow apache-beam

asked Sep 06 '22 at 22:40

Pablo

10,425
1
44
67

vote

2 answers

Google cloud dataflow job creation error: "Cannot set worker pool zone. Please check whether the worker_region experiments flag is valid"

I try to create a dataflow job to index a bigquery table into elasticSearchwith the node package google-cloud/dataflow.v1beta3. The job is working fine when it's created and launched from the google cloud console, but I have the following error when…

javascript node.js google-cloud-dataflow

asked Sep 05 '22 at 15:54

PierreM

vote

1 answer

Effect of using Apache Beam schemas

What is the use of specifying beam schemas in our code when we are reading source? How does it make our pipeline more efficient?

google-cloud-dataflow apache-beam

asked Sep 01 '22 at 10:14

tru

vote

1 answer

how to save logs from c++ binary in beam python?

I have a c++ binary that uses glog. I run that binary within beam python on cloud dataflow. I want to save c++ binary's stdout, stderr and any log file for later inspection. What's the best way to do that? This guide gives an example for beam java.…

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Aug 31 '22 at 00:13

bill

vote

2 answers

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache…

python-3.x google-cloud-dataflow apache-beam

asked Aug 30 '22 at 10:22

pa-nguyen

vote

1 answer

Web Crawler using Cloud Dataflow

I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow Is it a good choice to use…

google-cloud-platform web-crawler google-cloud-dataflow

asked Aug 27 '22 at 19:00

Sunil

Prev 1 2 3

…

100