Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam.

Scio is a Scala API for Apache Beam and Google Cloud Dataflow.

Cloud Dataflow is part of the Google Cloud Platform.

Some useful questions and answers to look at:

5328 questions

votes

2 answers

How can I install a python package onto Google Dataflow and import it into my pipeline?

My folder structure is as follows: Project/ --Pipeline.py --setup.py --dist/ --ResumeParserDependencies-0.1.tar.gz --Dependencies/ --Module1.py --Module2.py --Module3.py My setup.py file looks like this: from…

python google-cloud-dataflow

asked Sep 08 '18 at 21:39

Melissa Guo

votes

3 answers

Start kubernetes pod memory depending on size of data job

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)? Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how big the data will be for a given time-slice…

apache-spark kubernetes apache-spark-sql google-cloud-dataflow apache-beam

asked Jun 28 '18 at 03:51

cryanbhu

4,780
6
29
47

votes

2 answers

SlidingWindows for slow data (big intervals) on Apache Beam

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt). For example, at 00:20, I…

java google-cloud-dataflow apache-beam sliding-window

asked May 29 '18 at 07:16

tyron

3,715
1
22
36

votes

0 answers

Best Practices in Http Calls in Cloud Dataflow - Java

What's the best practices when http calls from a DoFn, in a pipeline that will be running in Google Cloud Dataflow? (Java) I mean, if in a pure Java w/o using Beam, I need to think about things like async calls, or at least multithreading. think…

google-cloud-dataflow apache-beam

asked May 14 '18 at 17:02

foxwendy

2,819
2
28
50

votes

3 answers

Dataflow job run failing when templateLocation argument is set

Dataflow job is failing with below exception when I pass parameters staging,temp & output GCS bucket locations. Java code: final String[] used = Arrays.copyOf(args, args.length + 1); used[used.length - 1] = "--project=OVERWRITTEN"; final T options…

google-cloud-platform google-cloud-dataflow

asked May 10 '18 at 06:32

Mohammed Niaz

votes

1 answer

Issues with Stateful processing in Apache Beam

So I've read both beam's stateful processing and timely processing articles and had found issues implementing the functions per se. The problem I am trying to solve is something similar to this to generate a sequential index for every line. Since I…

java state google-cloud-dataflow apache-beam

asked Apr 25 '18 at 21:14

Haris Nadeem

1,322
11
24

votes

1 answer

How to use google-cloud-storage directly in a Apache Beam project

We are working on an Apache Beam project (version 2.4.0) where we also want to work with a bucket directly through the google-cloud-storage API. However, combining some of the beam dependencies with cloud storage, leads to a hard to solve dependency…

java google-cloud-storage google-cloud-dataflow apache-beam

asked Apr 25 '18 at 09:33

Ivan Plantevin

votes

3 answers

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work. def print_each_line(line): print…

python google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Apr 19 '18 at 05:07

Brandon

votes

3 answers

Test Dataflow with DirectRunner and got lots of verifyUnmodifiedThrowingCheckedExceptions

I was testing my Dataflow pipeline using DirectRunner from my Mac and got lots of "WARNING" message like this, may I know how to get rid of them because it is too much that I can not even see my debug message. Thanks Apr 05, 2018 2:14:48 PM…

google-cloud-dataflow apache-beam

asked Apr 05 '18 at 21:27

DEWEI SUN

votes

2 answers

Invalid GCS URI used for staging location

When starting a dataflow job (v.2.4.0) via a jar with all dependencies included, instead of using the provided GCS path, it seems that a gs:/ folder is created locally, and because of this the dataflow workers try to access…

google-cloud-dataflow apache-beam

asked Apr 04 '18 at 15:20

bjorndv

votes

2 answers

BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded PCollection and FILE_LOADS

My workflow : KAFKA -> Dataflow streaming -> BigQuery Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDestination (one new table every hour, with the current…

google-cloud-platform google-bigquery google-cloud-dataflow apache-beam

asked Mar 12 '18 at 17:57

benjben

votes

3 answers

How to use Pandas in apache beam?

How to implement Pandas in Apache beam ? I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation…

pandas join google-cloud-dataflow apache-beam

asked Feb 15 '18 at 12:00

Nagesh Singh Chauhan

votes

1 answer

Refusing to split GroupedShuffleRangeTracker proposed split position is out of range

I am sporadically getting the following errors: W Refusing to split at '\x00\x00\x00\x15\xbc\x19)b\x00\x01': proposed split position is out of range ['\x00\x00\x00\x15\x00\xff\x00\xff\x00\xff\x00\xff\x00\x01', …

google-cloud-dataflow

asked Jan 29 '18 at 22:22

de1

2,986
1
15
32

votes

1 answer

Datastore poor performance with Apache Beam & Dataflow

I'm having huge performance issues with Datastore write speed. Most of the time it stays under 100 elements/s. I was able to achieve the speeds of around 2600 elements/s when bench marking the write speed on my local machine using the datastore…

java google-cloud-platform google-cloud-datastore google-cloud-dataflow apache-beam

asked Jan 22 '18 at 17:27

tpx

votes

1 answer

How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?

I want to read data from Cloud Pub/Sub and write it to BigQuery with Cloud Dataflow. Each data contains a table ID where the data itself will be saved. There are various factors that writing to BigQuery fails: Table ID format is wrong. Dataset…

google-bigquery google-cloud-dataflow apache-beam

asked Dec 28 '17 at 05:35

hmmnrst

Prev 1 2 3

…

99 100 Next