Questions tagged [apache-beam]

Apache Beam is a unified SDK for batch and stream processing. It allows to specify large-scale data processing workflows with a Beam-specific DSL. Beam workflows can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow (a cloud service).

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

The programming model behind Beam evolved at Google and was originally known as the “Dataflow Model”. Beam pipelines can be executed on different runtimes like Apache Flink, Apache Spark, or Google Cloud Dataflow.

References

Related Tags

4676 questions
10
votes
1 answer

Read Files from multiple folders in Apache Beam and map outputs to filenames

Working on reading files from multiple folders and then output the file contents with the file name like (filecontents, filename) to bigquery in apache beam using the python sdk and a dataflow runner. Originally thought I could create A pcollection…
10
votes
3 answers

Start CloudSQL Proxy on Python Dataflow / Apache Beam

I am currently working on a ETL Dataflow job (using the Apache Beam Python SDK) which queries data from CloudSQL (with psycopg2 and a custom ParDo) and writes it to BigQuery. My goal is to create a Dataflow template which I can start from a…
10
votes
1 answer

Windowing with Apache Beam - Fixed Windows Don't Seem to be Closing?

We are attempting to use fixed windows on an Apache Beam pipeline (using DirectRunner). Our flow is as follows: Pull data from pub/sub Deserialize JSON into Java object Window events w/ fixed windows of 5 seconds Using a custom CombineFn, combine…
Chris Staikos
  • 1,150
  • 10
  • 24
10
votes
2 answers

ClassNotFound exception when attempting to use DataflowRunner

I'm trying to launch a Dataflow job on GCP using Apache Beam 0.6.0. I am compiling an uber jar using the shade plugin because I cannot launch the job using "mvn:execjava". I'm including this dependency:
10
votes
1 answer

What does object of type '_UnwindowedValues' has no len() mean?

I'm using Dataflow 0.5.5 Python. Ran into the following error in very simple code: print(len(row_list)) row_list is a list. Exactly the same code, same data and same pipeline runs perfectly fine on DirectRunner, but throws the following exception…
foxwendy
  • 2,819
  • 2
  • 28
  • 50
10
votes
3 answers

How to convert csv into a dictionary in apache beam dataflow

I would like to read a csv file and write it to BigQuery using apache beam dataflow. In order to do this I need to present the data to BigQuery in the form of a dictionary. How can I transform the data using apache beam in order to do this? My input…
9
votes
2 answers

Dockerized Apache Beam returns "No id provided"

I've hit a problem with dockerized Apache Beam. When trying to run the container I am getting "No id provided." message and nothing more. Here's the code and files: Dockerfile FROM apache/beam_python3.8_sdk:latest RUN apt update RUN apt install -y…
Jakub Czaplicki
  • 1,787
  • 2
  • 28
  • 50
9
votes
0 answers

Apache Beam Python SDK ReadFromKafka does not receive data

I'm trying out a simple example of reading data off a Kafka topic into Apache Beam. Here's the relevant snippet: with beam.Pipeline(options=pipeline_options) as pipeline: _ = ( pipeline | 'Read from Kafka' >> ReadFromKafka( …
sumeetkm
  • 189
  • 1
  • 7
9
votes
1 answer

Apply TensorFlow Transform to transform/scale features in production

Overview I followed the following guide to write TF Records, where I used tf.Transform to preprocess my features. Now, I would like to deploy my model, for which I need apply this preprocessing function on real live data. My Approach First, suppose…
9
votes
2 answers

Dataflow/apache beam - how to access current filename when passing in pattern?

I have seen this question answered before on stack overflow (https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow), but not since apache beam has added splittable dofn functionality…
9
votes
1 answer

Dataflow, loading a file with a customer supplied encryption key

When trying to load a GCS file using a CSEK I get a dataflow error [ERROR] The target object is encrypted by a customer-supplied encryption key I was going to try to AES decrypt on the dataflow side, but I see I can't even get the file without…
9
votes
1 answer

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn and then across SimpleFunction. Both of these look similar to me and I find it difficult to understand the…
kaxil
  • 17,706
  • 2
  • 59
  • 78
9
votes
3 answers

worker_machine_type tag not working in Google Cloud Dataflow with python

I am using Apache Beam in Python with Google Cloud Dataflow (2.3.0). When specifying the worker_machine_type parameter as e.g. n1-highmem-2 or custom-1-6656, Dataflow runs the job but always uses the standard machine type n1-standard-1 for every…
dumkar
  • 735
  • 1
  • 5
  • 15
9
votes
1 answer

At what stage does Dataflow/Apache Beam ack a pub/sub message?

I have a dataflow streaming job with Pub/Sub subscription as an unbounded source. I want to know at what stage does dataflow acks the incoming pub/sub message. It appears to me that the message is lost if an exception is thrown during any stage of…
Kakaji
  • 1,421
  • 2
  • 15
  • 23
9
votes
1 answer

Apache Beam BigQueryIO write slow

My Beam pipeline is writing to an unpartitioned BigQuery target table. The PCollection consists of millions of TableRows. BigQueryIO apparently creates a temp file for every single record in the BigQueryWriteTemp temp folder first if I run it with…
jimmy
  • 4,471
  • 3
  • 22
  • 28