Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
10
votes
1 answer

Validating rows before inserting into BigQuery from Dataflow

According to How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in…
Theo
  • 131,503
  • 21
  • 160
  • 205
10
votes
3 answers

How to calculate the cost of a Google dataflow?

My company is evaluating if we can use Google Dataflow. I have run a dataflow on Google Cloud Platform. The console shows 5 hr 25 minutes in "Reserved CPU Time" field on the right. Worker configuration: n1-standard-4 Starting 8 workers... How to…
10
votes
2 answers

gsutil - is it possible to list only folders?

Is it possible to list only the folders in a bucket using the gsutil tool? I can't see anything listed here. For example, I'd like to list only the folders in this bucket:
Graham Polley
  • 14,393
  • 4
  • 44
  • 80
9
votes
2 answers

How to trigger Cloud Dataflow pipeline job from Cloud Function in Java?

I have a requirement to trigger the Cloud Dataflow pipeline from Cloud Functions. But the Cloud function must be written in Java. So the Trigger for Cloud Function is Google Cloud Storage's Finalise/Create Event, i.e., when a file is uploaded in a…
9
votes
1 answer

Can Google Data Fusion make the same data cleaning than DataPrep?

I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it. First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage,…
9
votes
4 answers

Is it possible to load a pretrained Pytorch model from a GCS bucket URL without first persisting locally?

I'm asking this in the context of Google Dataflow, but also generally. Using PyTorch, I can reference a local directory containing multiple files that comprise a pretrained model. I happen to be working with a Roberta model, but the interface is…
9
votes
2 answers

Dataflow/apache beam - how to access current filename when passing in pattern?

I have seen this question answered before on stack overflow (https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow), but not since apache beam has added splittable dofn functionality…
9
votes
1 answer

Dataflow, loading a file with a customer supplied encryption key

When trying to load a GCS file using a CSEK I get a dataflow error [ERROR] The target object is encrypted by a customer-supplied encryption key I was going to try to AES decrypt on the dataflow side, but I see I can't even get the file without…
9
votes
3 answers

worker_machine_type tag not working in Google Cloud Dataflow with python

I am using Apache Beam in Python with Google Cloud Dataflow (2.3.0). When specifying the worker_machine_type parameter as e.g. n1-highmem-2 or custom-1-6656, Dataflow runs the job but always uses the standard machine type n1-standard-1 for every…
dumkar
  • 735
  • 1
  • 5
  • 15
9
votes
2 answers

Java/Dataflow - Unable to use ClassLoader to detect classpath elements

I'm guessing this is more of a general Java/Eclipse question, but I'm not a Java guy and this isn't clicking for me. Stack trace at the…
bmark
  • 95
  • 1
  • 6
9
votes
2 answers

Coder issues with Apache Beam and CombineFn

We are building a pipeline using Apache Beam and DirectRunner as the runner. We are currently attempting a simple pipeline whereby we: Pull data from Google Cloud Pub/Sub (currently using the emulator to run locally) Deserialize into a Java…
9
votes
5 answers

How do I write to multiple files in Apache Beam?

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection>. And I want to write values to different files corresponding to their keys. For example, let's say the result consists of (key1,…
abcdabcd987
  • 2,043
  • 2
  • 23
  • 34
9
votes
2 answers

How to use transactional DatastoreIO

I’m using DatastoreIO from my streaming Dataflow pipeline and getting an error when writing an entity with the same key. 2016-12-10T22:51:04.385Z: Error: (af00222cfd901860): Exception: com.google.datastore.v1.client.DatastoreException: A…
9
votes
1 answer

How to make the environment variables reach Dataflow workers as environment variables in python sdk

I write custom sink with python sdk. I try to store data to AWS S3. To connect S3, some credential, secret key, is necessary, but it's not good to set in code for security reason. I would like to make the environment variables reach Dataflow workers…
9
votes
1 answer

How to get the cartesian product of two PCollections

I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2) and ("hello", "world"), their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"),…
Youness Bennani
  • 335
  • 2
  • 8