Questions tagged [google-dataflow]

49 questions
1
vote
1 answer

delete file from Google Storage from a Dataflow job

I have a dataflow made with apache-beam in python 3.7 where I process a file and then I have to delete it. The file comes from a google Storage bucket, and the problem is that when I use the DataflowRunner runner my job doesn't work because…
1
vote
1 answer

How to deploy Google Cloud Dataflow with connection to PostgreSQL (beam-nuggets) from Google Cloud Functions

I'm trying to create ETL in GCP which will read part of data from PostgreSQL and put it in the suitable form to BigQuery. I was able to perform this task deploying Dataflow from my computer, but I failed to make it dynamic, so it will read last…
1
vote
0 answers

How to auto-scale google dataflow (streaming) pipeline?

We have streaming pipeline running in Google Dataflow. It pulls Pub/Sub message and saves into BigQuery. For some reason, in last few day we have backlog. System lag shows 9-15 hours. I follow document here, and added following…
Krishna Sunuwar
  • 2,915
  • 16
  • 24
0
votes
1 answer

Unsupported schema specified for Pubsub source in CREATE TABLE

Following the link I found in Google, I'm trying to do a sample setup to publish message in pubsub and load the same into bigquery table using dataflow sql. But when I create dataflow job am getting below error: Invalid/unsupported arguments for…
0
votes
1 answer

Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java Here is my scenario: A large jsonl file is in google storage Json properties are unknown, so using Apache Beam's Schema can not be…
Ash
  • 2,095
  • 1
  • 19
  • 17
0
votes
1 answer

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague: The data source determines the watermark It seems you can add more flexibility through withAllowedLateness but what…
Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
0
votes
0 answers

How to read a json file from GCP bucket using java

I am trying to read a json file and map to Gson object on fly, tired reading using FileReader which is not working, also tried multiple ways but no luck.. can someone help me in reading from gcp bucket public static Response[] getJsonData() { …
0
votes
1 answer

Using gcloud SDK to download metrics for Google Dataflow

How can I use the command-line interface for GCP and download metrics such as utilization, autoscaling and backlogs via gcloud?
Baiqing
  • 1,223
  • 2
  • 9
  • 21
0
votes
0 answers

Google Dataflow Exception in the Reshuffle Step after 3 days of processing

This is the only error exception in the logs and all Dataflow workers shut down after 3.5 days of processing. It gets through more than half of the load. What does this error mean? Not sure if it is a memory issue that might get solved after…
0
votes
1 answer

How datastream cannot read UPDATE binary log in Google cloud Datastream

Hi I have some question to ask. I use Datastream to bigquery like guide below https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics. But when I start stream, I only saw data with change_type is INSERT. There is no…
0
votes
1 answer

PubSub streaming job is not working in Local runner

I'm trying the following example from the google official web site. import java.io.IOException; import org.apache.beam.examples.common.WriteOneFilePerWindow; import org.apache.beam.sdk.Pipeline; import…
0
votes
2 answers

including custom PTransform causes not found dependencies in the Dataflow job in GCP

I was trying to create a composite PTransform as following (Python): class LimitVolume(beam.PTransform): def __init__(self, daily_window, daily_limit): super().__init__() self.daily_window = daily_window self.daily_limit =…
Marina
  • 3,894
  • 9
  • 34
  • 41
0
votes
1 answer

Write to BigQuery table is failing in Dataflow pipeline

I am developing a Dataflow pipeline which is reading a protobuf file from google cloud storage and parsing it and trying to write to BigQuery table. It is working fine when no. of rows is around 20k but when no. of rows is around 200k then it fails.…
ravi
  • 391
  • 4
  • 12
0
votes
1 answer

"finish_bundle" method executing multiple times: Apache beam, Google Dataflow

I am trying to create JSON files in batches of 100 records each using apache beam pipeline as a Google Dataflow job. I am reading records from BigQuery and trying to create JSON files each having 100 records i.e. batch_size = 100 So I expect 7 JSON…
0
votes
1 answer

GroupByKey always holds holds everything in RAM, causing OOM

I'm writing a pipeline code that will be used in both batch and streaming mode with DataFlow, and I'm having OOM issues when using GroupByKey when operating in batch mode. The code bellow shows the issue: when I have a large file, GroupByKey appears…