Questions tagged [google-dataflow]
49 questions
1
vote
1 answer
delete file from Google Storage from a Dataflow job
I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because…

Felipe Sierra
- 143
- 2
- 12
1
vote
1 answer
How to deploy Google Cloud Dataflow with connection to PostgreSQL (beam-nuggets) from Google Cloud Functions
I'm trying to create ETL in GCP which will read part of data from PostgreSQL and put it in the suitable form to BigQuery. I was able to perform this task deploying Dataflow from my computer, but I failed to make it dynamic, so it will read last…

0Pat
- 341
- 3
- 9
1
vote
0 answers
How to auto-scale google dataflow (streaming) pipeline?
We have streaming pipeline running in Google Dataflow. It pulls Pub/Sub message and saves into BigQuery. For some reason, in last few day we have backlog. System lag shows 9-15 hours. I follow document here, and added following…

Krishna Sunuwar
- 2,915
- 16
- 24
0
votes
1 answer
Unsupported schema specified for Pubsub source in CREATE TABLE
Following the link I found in Google, I'm trying to do a sample setup to publish message in pubsub and load the same into bigquery table using dataflow sql.
But when I create dataflow job am getting below error:
Invalid/unsupported arguments for…

Vanaja Jayaraman
- 753
- 3
- 18
0
votes
1 answer
Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java
How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
A large jsonl file is in google storage
Json properties are unknown, so using Apache Beam's Schema can not be…

Ash
- 2,095
- 1
- 19
- 17
0
votes
1 answer
How does Google Dataflow determine the watermark for various sources?
I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what…

Dennis Jaheruddin
- 21,208
- 8
- 66
- 122
0
votes
0 answers
How to read a json file from GCP bucket using java
I am trying to read a json file and map to Gson object on fly, tired reading using FileReader which is not working, also tried multiple ways but no luck.. can someone help me in reading from gcp bucket
public static Response[] getJsonData() {
…

Ashok
- 13
- 7
0
votes
1 answer
Using gcloud SDK to download metrics for Google Dataflow
How can I use the command-line interface for GCP and download metrics such as utilization, autoscaling and backlogs via gcloud?

Baiqing
- 1,223
- 2
- 9
- 21
0
votes
0 answers
Google Dataflow Exception in the Reshuffle Step after 3 days of processing
This is the only error exception in the logs and all Dataflow workers shut down after 3.5 days of processing. It gets through more than half of the load. What does this error mean? Not sure if it is a memory issue that might get solved after…

Simant Luitel
- 33
- 1
- 5
0
votes
1 answer
How datastream cannot read UPDATE binary log in Google cloud Datastream
Hi I have some question to ask.
I use Datastream to bigquery like guide below https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics.
But when I start stream, I only saw data with change_type is INSERT. There is no…
0
votes
1 answer
PubSub streaming job is not working in Local runner
I'm trying the following example from the google official web site.
import java.io.IOException;
import org.apache.beam.examples.common.WriteOneFilePerWindow;
import org.apache.beam.sdk.Pipeline;
import…

Balasubramanian Naagarajan
- 338
- 3
- 17
0
votes
2 answers
including custom PTransform causes not found dependencies in the Dataflow job in GCP
I was trying to create a composite PTransform as following (Python):
class LimitVolume(beam.PTransform):
def __init__(self, daily_window, daily_limit):
super().__init__()
self.daily_window = daily_window
self.daily_limit =…

Marina
- 3,894
- 9
- 34
- 41
0
votes
1 answer
Write to BigQuery table is failing in Dataflow pipeline
I am developing a Dataflow pipeline which is reading a protobuf file from google cloud storage and parsing it and trying to write to BigQuery table. It is working fine when no. of rows is around 20k but when no. of rows is around 200k then it fails.…

ravi
- 391
- 4
- 12
0
votes
1 answer
"finish_bundle" method executing multiple times: Apache beam, Google Dataflow
I am trying to create JSON files in batches of 100 records each using apache beam pipeline as a Google Dataflow job.
I am reading records from BigQuery and trying to create JSON files each having 100 records i.e. batch_size = 100
So I expect 7 JSON…

Gopinath S
- 101
- 1
- 14
0
votes
1 answer
GroupByKey always holds holds everything in RAM, causing OOM
I'm writing a pipeline code that will be used in both batch and streaming mode with DataFlow, and I'm having OOM issues when using GroupByKey when operating in batch mode. The code bellow shows the issue: when I have a large file, GroupByKey appears…

Francisco Delmar Kurpiel
- 376
- 3
- 14