Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

vote

1 answer

Dataflow flex template job attempts to launch second job (for pipeline) with same job_name

I am trying to launch a Dataflow flex template. As part of the build and deploy process, I am pre-building a custom SDK container image to reduce worker start-up time. I have attempted this in these ways: When no sdk_container_image is specified…

google-cloud-dataflow

asked Oct 18 '22 at 19:44

ddjanke

vote

1 answer

What is the difference between to use org.apache.hadoop.hbase.client Vs com.google.cloud.bigtable.data.v2 on dataflow gcp

There are a difference of perfomance o stability, or long term support maybe ?. I mean is needed migrate hbase api to big table connector apache beam.

java google-cloud-platform hbase google-cloud-dataflow google-cloud-bigtable

asked Oct 18 '22 at 15:27

masterdevsshm83_

vote

1 answer

Apache beam dataflow job issue - Runnable workflow has no steps specified

I am trying to create a dataflow job with the custom template and getting an error as "Runnable workflow has no steps specified." The log has no info except this. Am I missing any steps? Have created a virtual env and executed the below code. I'm…

python google-cloud-platform google-cloud-dataflow apache-beam

asked Oct 12 '22 at 22:18

Abhishek Boga

vote

1 answer

Using FinishBundle in Apache Beam Go SDK

I am attempting to use FinishBundle() to batch requests in beam on dataflow. These requests are fetching information and emitting it for further processing downstream in the pipeline, a la: func BatchRpcFn { client RpcClient bufferRequest…

google-cloud-dataflow apache-beam

asked Oct 11 '22 at 13:06

Cam Phillips

vote

2 answers

How to migrate file from on-prem to GCS?

I want to build an ETL pipeline that: Read files from the filesystem on-prem Write the file into a Cloud Storage bucket. Is it possible to import the files (regurarly, every day) directly with the Storage Transfer Service? Let's suppose I want to…

python google-cloud-storage etl google-cloud-dataflow apache-beam-io

asked Oct 10 '22 at 14:38

alex-mont

vote

1 answer

Can I collect data in Apache beam pipeline in every 5 minutes and perform analysis on that data collectively after a hour?

I am currently having a Apache beam pipeline that writes Pub/Sub message to Bigquery and GCS in real-time and my next goal was to pull the messages from Pub/Sub at an interval of every 5 minute and collectively perform analysis on it those multiple…

google-cloud-platform google-bigquery google-cloud-dataflow apache-beam google-cloud-pubsub

asked Oct 10 '22 at 04:35

Mihir Sharma

vote

1 answer

How to append different PubSub objects and flatten them to write them altogether into bigquery as a single JSON?

I wanted to write three attributes (data, attributes and publish time) of a Pub/Sub message to Bigquery and wanted them to print in a flattened way so that all elements writes in a single row, for…

google-cloud-platform google-bigquery google-cloud-dataflow apache-beam google-cloud-pubsub

asked Oct 07 '22 at 10:38

Mihir Sharma

vote

1 answer

How to Process my PubSub Message Object and Write all objects into BigQuery in Apache Beam using python?

I am trying to write all the elements of a Pub/Sub message (data,attributes,messageId and publish_time) to BigQuery using Apache Beam and wanted the data to be as: data attr key publishTime data attr key publishTime I am currently using…

google-cloud-platform google-bigquery google-cloud-dataflow apache-beam google-cloud-pubsub

asked Oct 07 '22 at 05:56

Mihir Sharma

vote

0 answers

How to save file adding previous name as prefix after ImageDataGenerator

Using ImageDataGenerator of Keras. suppose my folder structure is like a - 1.jpg 2.jpg 3.jpg b - 5.jpg 6.jpg 7.jpg I am doing the augmentation like below : for i in range (20): for label in LABELS: # "LABELS" is the folder name…

keras google-cloud-dataflow prefix image-augmentation imagedatagenerator

asked Oct 03 '22 at 09:22

siam

vote

0 answers

Unable to start Google Cloud Profiler due to error: Unable to find the job id or job name from env var

We followed the Cloud Profiler documentation to enable the Cloud Profiler for our Dataflow jobs and the Profiler is failing to start. The issue is, Cloud Profiler needs JOB_NAME and JOB_ID environment vars to start but the worker VM has only the…

google-cloud-platform etl google-cloud-dataflow apache-beam google-cloud-profiler

asked Sep 30 '22 at 12:01

buraktokman

vote

1 answer

How to get response from HTTP Sink Plugin in Cloud Data Fusion?

Expertise: I am new to Cloud Data Fusion. What I am trying to achieve: Create a Data pipeline in the Google Cloud Data Fusion: Read a file from GCS. Call an HTTP Endpoint with the parsed data of GCS. Save the response received from HTTP in the GCS…

google-cloud-platform google-cloud-dataflow google-cloud-data-fusion cdap

asked Sep 29 '22 at 08:36

Zahid Khan

2,130
2
18
31

vote

1 answer

Write To BigQuery retry on transform or stage?

Lets say that we have dataflow streaming pipeline which reads data from pubsub, transforms it in pardo and writes to BQ. Also lets assume that dataflow optimizer squashes all of those steps into single stage. My questions is: if the final step -…

google-bigquery google-cloud-dataflow apache-beam

asked Sep 28 '22 at 18:42

Pav3k

vote

2 answers

Triggering an alert when multiple dataflow jobs run in parallel in GCP

I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel. Since each job is quite resource intensive, I am looking for a way to trigger an alert…

google-cloud-platform google-cloud-dataflow gcp-alerts

asked Sep 28 '22 at 12:27

Tharindu Kumara

4,398
2
28
46

vote

1 answer

Data enrichment for multiple pubsub topics

I have 2 topics: player_info_topic example message: {"id": 1, "name": "Sandy"} seating_arrangement_topic example message: {"id": 1, "seat": 2} Is there a way to match these messages in gcp, cloud dataflow maybe then publish to another pubsub…

google-cloud-platform google-cloud-dataflow google-cloud-pubsub

asked Sep 28 '22 at 04:59

emurmotol

vote

1 answer

apache_beam, read data from GCS buckets during pipeline

I have a pub/sub topic, which gets message as soon as a file is created in the bucket, with the streaming pipeline, I am able to get the object path. Created file is AVRO. Now in my pipeline I want to read all the content of the different files,…

google-cloud-dataflow apache-beam apache-beam-io apache-beam-internals

asked Sep 27 '22 at 04:25

Daljeet Singh

Prev 1 2 3

…

100 Next