Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
1
vote
1 answer

Dataflow flex template job attempts to launch second job (for pipeline) with same job_name

I am trying to launch a Dataflow flex template. As part of the build and deploy process, I am pre-building a custom SDK container image to reduce worker start-up time. I have attempted this in these ways: When no sdk_container_image is specified…
ddjanke
  • 121
  • 4
1
vote
1 answer

What is the difference between to use org.apache.hadoop.hbase.client Vs com.google.cloud.bigtable.data.v2 on dataflow gcp

There are a difference of perfomance o stability, or long term support maybe ?. I mean is needed migrate hbase api to big table connector apache beam.
1
vote
1 answer

Apache beam dataflow job issue - Runnable workflow has no steps specified

I am trying to create a dataflow job with the custom template and getting an error as "Runnable workflow has no steps specified." The log has no info except this. Am I missing any steps? Have created a virtual env and executed the below code. I'm…
1
vote
1 answer

Using FinishBundle in Apache Beam Go SDK

I am attempting to use FinishBundle() to batch requests in beam on dataflow. These requests are fetching information and emitting it for further processing downstream in the pipeline, a la: func BatchRpcFn { client RpcClient bufferRequest…
1
vote
2 answers

How to migrate file from on-prem to GCS?

I want to build an ETL pipeline that: Read files from the filesystem on-prem Write the file into a Cloud Storage bucket. Is it possible to import the files (regurarly, every day) directly with the Storage Transfer Service? Let's suppose I want to…
1
vote
1 answer

Can I collect data in Apache beam pipeline in every 5 minutes and perform analysis on that data collectively after a hour?

I am currently having a Apache beam pipeline that writes Pub/Sub message to Bigquery and GCS in real-time and my next goal was to pull the messages from Pub/Sub at an interval of every 5 minute and collectively perform analysis on it those multiple…
1
vote
1 answer

How to append different PubSub objects and flatten them to write them altogether into bigquery as a single JSON?

I wanted to write three attributes (data, attributes and publish time) of a Pub/Sub message to Bigquery and wanted them to print in a flattened way so that all elements writes in a single row, for…
1
vote
1 answer

How to Process my PubSub Message Object and Write all objects into BigQuery in Apache Beam using python?

I am trying to write all the elements of a Pub/Sub message (data,attributes,messageId and publish_time) to BigQuery using Apache Beam and wanted the data to be as: data attr key publishTime data attr key publishTime I am currently using…
1
vote
0 answers

How to save file adding previous name as prefix after ImageDataGenerator

Using ImageDataGenerator of Keras. suppose my folder structure is like a - 1.jpg 2.jpg 3.jpg b - 5.jpg 6.jpg 7.jpg I am doing the augmentation like below : for i in range (20): for label in LABELS: # "LABELS" is the folder name…
1
vote
0 answers

Unable to start Google Cloud Profiler due to error: Unable to find the job id or job name from env var

We followed the Cloud Profiler documentation to enable the Cloud Profiler for our Dataflow jobs and the Profiler is failing to start. The issue is, Cloud Profiler needs JOB_NAME and JOB_ID environment vars to start but the worker VM has only the…
1
vote
1 answer

How to get response from HTTP Sink Plugin in Cloud Data Fusion?

Expertise: I am new to Cloud Data Fusion. What I am trying to achieve: Create a Data pipeline in the Google Cloud Data Fusion: Read a file from GCS. Call an HTTP Endpoint with the parsed data of GCS. Save the response received from HTTP in the GCS…
1
vote
1 answer

Write To BigQuery retry on transform or stage?

Lets say that we have dataflow streaming pipeline which reads data from pubsub, transforms it in pardo and writes to BQ. Also lets assume that dataflow optimizer squashes all of those steps into single stage. My questions is: if the final step -…
Pav3k
  • 869
  • 4
  • 10
1
vote
2 answers

Triggering an alert when multiple dataflow jobs run in parallel in GCP

I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel. Since each job is quite resource intensive, I am looking for a way to trigger an alert…
1
vote
1 answer

Data enrichment for multiple pubsub topics

I have 2 topics: player_info_topic example message: {"id": 1, "name": "Sandy"} seating_arrangement_topic example message: {"id": 1, "seat": 2} Is there a way to match these messages in gcp, cloud dataflow maybe then publish to another pubsub…
1
vote
1 answer

apache_beam, read data from GCS buckets during pipeline

I have a pub/sub topic, which gets message as soon as a file is created in the bucket, with the streaming pipeline, I am able to get the object path. Created file is AVRO. Now in my pipeline I want to read all the content of the different files,…
1 2 3
99
100