Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions
6
votes
2 answers

How can I install a python package onto Google Dataflow and import it into my pipeline?

My folder structure is as follows: Project/ --Pipeline.py --setup.py --dist/ --ResumeParserDependencies-0.1.tar.gz --Dependencies/ --Module1.py --Module2.py --Module3.py My setup.py file looks like this: from…
Melissa Guo
  • 948
  • 9
  • 32
6
votes
3 answers

Start kubernetes pod memory depending on size of data job

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)? Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how big the data will be for a given time-slice…
6
votes
2 answers

SlidingWindows for slow data (big intervals) on Apache Beam

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt). For example, at 00:20, I…
tyron
  • 3,715
  • 1
  • 22
  • 36
6
votes
0 answers

Best Practices in Http Calls in Cloud Dataflow - Java

What's the best practices when http calls from a DoFn, in a pipeline that will be running in Google Cloud Dataflow? (Java) I mean, if in a pure Java w/o using Beam, I need to think about things like async calls, or at least multithreading. think…
foxwendy
  • 2,819
  • 2
  • 28
  • 50
6
votes
3 answers

Dataflow job run failing when templateLocation argument is set

Dataflow job is failing with below exception when I pass parameters staging,temp & output GCS bucket locations. Java code: final String[] used = Arrays.copyOf(args, args.length + 1); used[used.length - 1] = "--project=OVERWRITTEN"; final T options…
Mohammed Niaz
  • 386
  • 1
  • 5
  • 17
6
votes
1 answer

Issues with Stateful processing in Apache Beam

So I've read both beam's stateful processing and timely processing articles and had found issues implementing the functions per se. The problem I am trying to solve is something similar to this to generate a sequential index for every line. Since I…
Haris Nadeem
  • 1,322
  • 11
  • 24
6
votes
1 answer

How to use google-cloud-storage directly in a Apache Beam project

We are working on an Apache Beam project (version 2.4.0) where we also want to work with a bucket directly through the google-cloud-storage API. However, combining some of the beam dependencies with cloud storage, leads to a hard to solve dependency…
6
votes
3 answers

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work. def print_each_line(line): print…
6
votes
3 answers

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my debug message. Thanks Apr 05, 2018 2:14:48 PM…
DEWEI SUN
  • 61
  • 1
  • 4
6
votes
2 answers

Invalid GCS URI used for staging location

When starting a dataflow job (v.2.4.0) via a jar with all dependencies included, instead of using the provided GCS path, it seems that a gs:/ folder is created locally, and because of this the dataflow workers try to access…
bjorndv
  • 523
  • 1
  • 5
  • 16
6
votes
2 answers

BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded PCollection and FILE_LOADS

My workflow : KAFKA -> Dataflow streaming -> BigQuery Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDestination (one new table every hour, with the current…
6
votes
3 answers

How to use Pandas in apache beam?

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation…
6
votes
1 answer

Refusing to split GroupedShuffleRangeTracker proposed split position is out of range

I am sporadically getting the following errors: W Refusing to split at '\x00\x00\x00\x15\xbc\x19)b\x00\x01': proposed split position is out of range ['\x00\x00\x00\x15\x00\xff\x00\xff\x00\xff\x00\xff\x00\x01', …
de1
  • 2,986
  • 1
  • 15
  • 32
6
votes
1 answer

Datastore poor performance with Apache Beam & Dataflow

I'm having huge performance issues with Datastore write speed. Most of the time it stays under 100 elements/s. I was able to achieve the speeds of around 2600 elements/s when bench marking the write speed on my local machine using the datastore…
6
votes
1 answer

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved. There are various factors that writing to BigQuery fails: Table ID format is wrong. Dataset…