Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
1
vote
1 answer

Sharing Beam State across different DoFns

Is Beam State shared across different DoFns? Lets say I have 2 DoFns: StatefulDoFn1: { myState.write(1)} StatefulDoFn2: { myState.read() ; do something ... output} And then the pipeline in pseudocode: pipline =…
StaticBug
  • 557
  • 5
  • 15
1
vote
1 answer

How to convert a non-templated beam job to templated job and run it on GCP Dataflow runner?

I was able to run the non-templated beam job directly on the GCP dataflow runner by using the below command : java -jar --runner=DataFlowRunner --gcpTempLocation=gs://some/gcs/location --stagingLocation=gs://some/gcs/location/stage…
1
vote
1 answer

Can not sink to BigQuery using Dataflow Apache Beam

I have 2 csv files: expeditions- 2010s.csv and peaks.csv with the join key 'peak_id'. I'm using notebook with Apache Beam in Dataflow to join them. Here is my code as below def read_csv_file(readable_file): import apache_beam as beam import…
1
vote
1 answer

Apache Beam: IllegalStateException - Value only available at runtime after upgrading to beam 2.41.0

I upgraded my Apache Beam version from 2.34.0 to 2.41.0 and getting the following error when trying to build the template. The error: Exception in thread "main" java.lang.IllegalStateException: Value only available at runtime, but accessed from a…
jurl
  • 2,504
  • 1
  • 17
  • 20
1
vote
2 answers

Google Dataflow store to specific Partition using BigQuery Storage Write API

I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date. I…
1
vote
1 answer

Avoid session shutdown on BigQuery Storage API with Dataflow

I am implementing an ETL job that migrates a non partitioned BigQuery Table to a partitioned one. To do so I use the Storage API from BigQuery. This creates a number of sessions to pull Data from. In order to route the BigQuery writes to the right…
1
vote
1 answer

What does the pipeline "state" mean in DataFlow?

I am a beginner in Dataflow. There is a concept I'm not sure I understand and this is the "state". When talking about the pipeline state, does it mean the data in the pipeline ? For example, when taking a DataFlow snapshot, the documentation says…
user1021712
  • 333
  • 1
  • 3
  • 10
1
vote
1 answer

How to override Google DataFlow logging with logback?

We deployed DataFlows in Google Cloud. Dataflows are developed using Apache Beam. Dataflow logging doesn't include the transaction id, which is needed for tracing the transaction in the pipeline. Any logging pattern used in the logback is being…
1
vote
2 answers

Left join with CoGroupByKey sink to BigQuery using Dataflow

I would like to join files (expeditions- 2010s.csv and peaks.csv) using join key "peakid" with CoGroupByKey. However, there is an error when I sink it to BigQuery: RuntimeError: BigQuery job…
1
vote
1 answer

Apache Beam: Reading from Kafka starting at the initial offset rather than the latest

I am trying to write a simple Beam pipeline that starts consuming data from the earliest offsets existing in the partitions of each Kafka Topic. I have not been able to figure out how to consume data from the earliest possible offsets in a topic.
Pablo
  • 10,425
  • 1
  • 44
  • 67
1
vote
2 answers

Google cloud dataflow job creation error: "Cannot set worker pool zone. Please check whether the worker_region experiments flag is valid"

I try to create a dataflow job to index a bigquery table into elasticSearchwith the node package google-cloud/dataflow.v1beta3. The job is working fine when it's created and launched from the google cloud console, but I have the following error when…
PierreM
  • 21
  • 3
1
vote
1 answer

Effect of using Apache Beam schemas

What is the use of specifying beam schemas in our code when we are reading source? How does it make our pipeline more efficient?
tru
  • 55
  • 5
1
vote
1 answer

how to save logs from c++ binary in beam python?

I have a c++ binary that uses glog. I run that binary within beam python on cloud dataflow. I want to save c++ binary's stdout, stderr and any log file for later inspection. What's the best way to do that? This guide gives an example for beam java.…
bill
  • 650
  • 8
  • 17
1
vote
2 answers

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache…
pa-nguyen
  • 417
  • 1
  • 5
  • 16
1
vote
1 answer

Web Crawler using Cloud Dataflow

I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow Is it a good choice to use…
1 2 3
99
100