Questions tagged [google-dataflow]

49 questions
0
votes
1 answer

How to put nested JSON data into BigQuery table with Google Cloud Platform's dataflow's Pub/Sub Topic -> BigQuery Template

I am trying to store messages sent from an IoT device in a BigQuery table. The cloud architecture is as follows: Local Device -> json_message -> mqtt_client -> GC IoT device -> Device Registry -> Pub/Sub Topic -> Dataflow with Pub/Sub Topic to…
0
votes
1 answer

Query in Firebase realtime database

What will be the Indexing and query for this in Firebase real-time Database?
0
votes
0 answers

PubSub Unacked messages are not Showing up when using dataflow as subscriber

I have a dataflow pipeline which reads messages from a subscription. It is working fine when messages are coming in correct format. But when messages are not in proper format ,it's throwing error. I decided to use the dead letter topic when there is…
alrou
  • 51
  • 8
0
votes
1 answer

Convert the date format from DD-MM-YYYY to YYYY-MM-DD in big query by using 'text files on cloud storage to big query ' Dataflow template GCP

I am new to GCP requesting some help to solve my issue. I am creating CSV file, json file and java script file and uploading into GCP bucket. Creating the 'Text files on cloud storage to big query' Dataflow template to populate the data into…
0
votes
1 answer

Backfill Beam pipeline with historical data

I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub. In order to bring the pipeline state up to date, and to create the correct outputs,…
Raman
  • 17,606
  • 5
  • 95
  • 112
0
votes
1 answer

How to process dataflow two batch files simultaneously on GCP

I want to process two files from gcp to dataflow at the same time simultaneously. I think it will be possible if one more file comes in side-input. However, in this case, I think it will be processed every time, not just once. e.g) How to read and…
Quack
  • 680
  • 1
  • 8
  • 22
0
votes
1 answer

Migrating from Google App Engine Mapreduce to Apache Beam

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's…
speedplane
  • 15,673
  • 16
  • 86
  • 138
0
votes
1 answer

beam.Create() with list of dicts is extremely slow compared to a list of strings

I am using Dataflow to process a Shapefile with about 4 million features (about 2GB total) and load the geometries into BigQuery, so before my pipeline starts, I extract the shapefile features into a list, and initialize the pipeline using…
Travis Webb
  • 14,688
  • 7
  • 55
  • 109
0
votes
1 answer

Error 401 with cloud scheduler while passing Dataflow template as URL via POST request

I have created a custom template for Dataflow Batch Jobs. Now I need to run every 5 minutes using cloud scheduler. The template is stored in cloud storage. But I'm getting 401 error, whenever I pass the URI of template in my POST request from…
0
votes
2 answers

Running a dataflow batch using flexRSGoal

I found this article about running a dataflow batch on preemptive machines. I tried to use this feature using this script: gcloud beta dataflow jobs run $JOB_NAME \ --gcs-location gs://.../Datastore_to_Datastore_Delete \ …
0
votes
1 answer

Apache Beam - Bigquery Upsert

I have a dataflow job which splits up a single file into x number of records (tables). These flow in to bigquery no problem. What I found though was there was no way to then execute another stage in the pipeline following the results. For example #…
YetiBoy
  • 51
  • 1
  • 4
0
votes
1 answer

Error when creating Google Dataflow template file

I'm trying to schedule a Dataflow that ends after a set amount of time using a template. I'm able to successfully do this when using the command line, but when I try and do it with Google Cloud Scheduler I run into an error when I create my…
0
votes
0 answers

Time limit possible for Google's Dataflow?

I've managed to use Google Cloud Scheduler to schedule a dataflow pipeline running, but I also want the pipeline to run for max an hour. Is it possible to schedule an end time for dataflow? edit: I've created a pipeline that would wait a certain…
0
votes
0 answers

Error in SQL Launcher (java.lang.NullPointerException) in Google Dataflow SQL

I am trying to read the data from a Pubsub topic using Google dataflow SQL and getting "NullPointerException" error. Could anyone guide me on what I am doing wrong. Below is the SQL query. I tried selecting few columns also. Same error is…
0
votes
1 answer

Using schema update option in beam.io.writetobigquery

I am loading a bunch log files into BigQuery using apache beam data flow. The file format can change over a period of time by adding new columns to the files. I see Schema Update Option ALLOW_FILED_ADDITION. Anyone know how to use it? This is how my…
rens
  • 43
  • 6