Questions tagged [apache-beam]

Apache Beam is a unified SDK for batch and stream processing. It allows to specify large-scale data processing workflows with a Beam-specific DSL. Beam workflows can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow (a cloud service).

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The programming model behind Beam evolved at Google and was originally known as the “Dataflow Model”. Beam pipelines can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow.

References

Related Tags

4676 questions

votes

2 answers

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this: The first step is doing some basic processing and takes about 2 seconds per message to get to…

google-cloud-dataflow apache-beam

asked Feb 19 '19 at 10:31

Brecht Coghe

votes

2 answers

Converting tokens to word vectors effectively with TensorFlow Transform

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase. I followed this StackOverflow post and implemented the initial conversion from tokens to vectors. The conversion works as…

tensorflow word2vec apache-beam tensorflow-transform glove

asked Jul 31 '18 at 05:40

Tony Yotto

votes

3 answers

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once). So far, I've tried the following options: Use TextIO.read():…

apache-beam

asked Jul 20 '18 at 09:17

Kricket

4,049
8
33
46

votes

1 answer

How to combine streaming data with large history data set in Dataflow/Beam

I am investigating processing logs from web user sessions via Google Dataflow/Apache Beam and need to combine the user's logs as they come in (streaming) with the history of a user's session from the last month. I have looked at the following…

java apache-flink google-cloud-dataflow apache-beam

asked Apr 29 '16 at 00:21

Florian

votes

0 answers

How can I specify the number of workers for my Dataflow?

I have an Apache Beam pipeline that loads a large import file of around 90GB. I've written the pipeline in the Apache Beam Java SDK. Using the default settings for PipelineOptionsFactory, my job takes quite a while to complete. How can I control,…

google-cloud-dataflow apache-beam

asked Jan 19 '15 at 22:16

Alex Harvey

votes

3 answers

Dataflow Pipeline - "Processing stuck in step for at least without outputting or completing in state finish..."

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are…

google-cloud-dataflow apache-beam

asked Mar 04 '19 at 19:39

Caio Riva

votes

2 answers

Apache Beam over Apache Kafka Stream processing

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing? I am trying to grasp the technical and programmatic differences as well. Please help me understand by reporting from your experience.

apache-kafka apache-beam apache-kafka-streams stream-processing

asked Jun 14 '18 at 20:04

Stella

1,728
5
41
95

votes

3 answers

Apache Beam Counter/Metrics not available in Flink WebUI

I'm using Flink 1.4.1 and Beam 2.3.0, and would like to know is it possible to have metrics available in Flink WebUI (or anywhere at all), as in Dataflow WebUI ? I've used counter like: import org.apache.beam.sdk.metrics.Counter; import…

java apache-flink metrics apache-beam

asked Feb 27 '18 at 16:51

robosoul

votes

1 answer

Apache Beam in Dataflow Large Side Input

This is most similar to this question. I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and…

java google-bigquery google-cloud-dataflow apache-beam

asked Nov 27 '17 at 19:09

Taylor

votes

1 answer

import apache_beam metaclass conflict

When I try to import apache beam I get the following error. >>> import apache_beam Traceback (most recent call last): File "", line 1, in File "/home/toor/pfff/local/lib/python2.7/site-packages/apache_beam/__init__.py", line 78,…

python-2.7 google-cloud-dataflow apache-beam

asked Sep 19 '17 at 12:12

Tijl Vandevyvere

votes

1 answer

Apache Beam - Integration test with unbounded PCollection

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context... Details about our pipeline: We use PubsubIO as our data source (unbounded PCollection) Intermediate transforms include a…

java integration-testing google-cloud-dataflow google-cloud-pubsub apache-beam

asked Jun 23 '17 at 18:28

Chris Staikos

1,150
10
24

votes

2 answers

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo. How can I do this using the Apache Beam BigQueryIO API?

google-bigquery google-cloud-dataflow apache-beam

asked Apr 19 '17 at 20:32

jkff

17,623
5
53
85

votes

2 answers

Error with installing apache-beam[gcp] on mac zsh terminal - “zsh: no matches found: apache-beam[gcp]”

I am using zsh, and I have installed gcloud in order to interact with GCP via local terminal on my Mac. I am encountering this error “zsh: no matches found: apache-beam[gcp]”. However, when I run the command directly on the bash terminal on the GCP…

google-cloud-platform google-cloud-dataflow apache-beam

asked Mar 11 '20 at 14:21

Sadeeq Akintola

votes

5 answers

Kotlin Iterable not supported in Apache Beam?

Apache beam seems to be refusing to recognise Kotlin's Iterable. Here is a sample code: @ProcessElement fun processElement( @Element input: KV>, receiver: OutputReceiver ) { val output = input.key + "|" +…

kotlin google-cloud-dataflow apache-beam

asked Apr 29 '19 at 18:32

marcoseu

3,892
2
16
35

votes

2 answers

Google Cloud Data flow stuck with repeated error 'Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff'

SDK: Apache Beam SDK for Go 0.5.0 Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure…

go google-cloud-dataflow apache-beam

asked Dec 12 '18 at 02:15

Tim

2,667
4
32
39

Prev 1

…

99 100 Next