Questions tagged [apache-beam]

Apache Beam is a unified SDK for batch and stream processing. It allows to specify large-scale data processing workflows with a Beam-specific DSL. Beam workflows can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow (a cloud service).

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The programming model behind Beam evolved at Google and was originally known as the “Dataflow Model”. Beam pipelines can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow.

References

Related Tags

4676 questions
14
votes
2 answers

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this: The first step is doing some basic processing and takes about 2 seconds per message to get to…
Brecht Coghe
  • 286
  • 1
  • 7
13
votes
2 answers

Converting tokens to word vectors effectively with TensorFlow Transform

I would like to use TensorFlow Transform to convert tokens to word vectors during my training, validation and inference phase. I followed this StackOverflow post and implemented the initial conversion from tokens to vectors. The conversion works as…
13
votes
3 answers

How to read large CSV with Beam?

I'm trying to figure out how to use Apache Beam to read large CSV files. By "large" I mean, several gigabytes (so that it would be impractical to read the entire CSV into memory at once). So far, I've tried the following options: Use TextIO.read():…
Kricket
  • 4,049
  • 8
  • 33
  • 46
13
votes
1 answer

How to combine streaming data with large history data set in Dataflow/Beam

I am investigating processing logs from web user sessions via Google Dataflow/Apache Beam and need to combine the user's logs as they come in (streaming) with the history of a user's session from the last month. I have looked at the following…
Florian
  • 133
  • 1
  • 5
13
votes
0 answers

How can I specify the number of workers for my Dataflow?

I have an Apache Beam pipeline that loads a large import file of around 90GB. I've written the pipeline in the Apache Beam Java SDK. Using the default settings for PipelineOptionsFactory, my job takes quite a while to complete. How can I control,…
Alex Harvey
  • 215
  • 3
  • 8
12
votes
3 answers

Dataflow Pipeline - "Processing stuck in step for at least

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are…
Caio Riva
  • 141
  • 1
  • 4
12
votes
2 answers

Apache Beam over Apache Kafka Stream processing

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing? I am trying to grasp the technical and programmatic differences as well. Please help me understand by reporting from your experience.
Stella
  • 1,728
  • 5
  • 41
  • 95
12
votes
3 answers

Apache Beam Counter/Metrics not available in Flink WebUI

I'm using Flink 1.4.1 and Beam 2.3.0, and would like to know is it possible to have metrics available in Flink WebUI (or anywhere at all), as in Dataflow WebUI ? I've used counter like: import org.apache.beam.sdk.metrics.Counter; import…
robosoul
  • 757
  • 7
  • 17
12
votes
1 answer

Apache Beam in Dataflow Large Side Input

This is most similar to this question. I am creating a pipeline in Dataflow 2.x that takes streaming input from a Pubsub queue. Every single message that comes in needs to be streamed through a very large dataset that comes from Google BigQuery and…
Taylor
  • 176
  • 1
  • 9
12
votes
1 answer

import apache_beam metaclass conflict

When I try to import apache beam I get the following error. >>> import apache_beam Traceback (most recent call last): File "", line 1, in File "/home/toor/pfff/local/lib/python2.7/site-packages/apache_beam/__init__.py", line 78,…
11
votes
1 answer

Apache Beam - Integration test with unbounded PCollection

We are building an integration test for an Apache Beam pipeline and are running into some issues. See below for context... Details about our pipeline: We use PubsubIO as our data source (unbounded PCollection) Intermediate transforms include a…
11
votes
2 answers

Writing different values to different BigQuery tables in Apache Beam

Suppose I have a PCollection and I want to write it to multiple BigQuery tables, choosing a potentially different table for each Foo. How can I do this using the Apache Beam BigQueryIO API?
jkff
  • 17,623
  • 5
  • 53
  • 85
10
votes
2 answers

Error with installing apache-beam[gcp] on mac zsh terminal - “zsh: no matches found: apache-beam[gcp]”

I am using zsh, and I have installed gcloud in order to interact with GCP via local terminal on my Mac. I am encountering this error “zsh: no matches found: apache-beam[gcp]”. However, when I run the command directly on the bash terminal on the GCP…
10
votes
5 answers

Kotlin Iterable not supported in Apache Beam?

Apache beam seems to be refusing to recognise Kotlin's Iterable. Here is a sample code: @ProcessElement fun processElement( @Element input: KV>, receiver: OutputReceiver ) { val output = input.key + "|" +…
marcoseu
  • 3,892
  • 2
  • 16
  • 35
10
votes
2 answers

Google Cloud Data flow stuck with repeated error 'Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff'

SDK: Apache Beam SDK for Go 0.5.0 Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure…
Tim
  • 2,667
  • 4
  • 32
  • 39