Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
2
votes
1 answer

Apache Beam I/O Transforms

The Apache Beam documentation Authoring I/O Transforms - Overview states: Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink…
SSG
  • 40
  • 5
2
votes
1 answer

Is there a way to perform a Redis GET command with the built-in Apache beam Redis I/O transform?

My use case for Google Cloud Dataflow is to use Redis as a cache during the pipeline, since the transformation to occur depends on some cached data. This would mean performing Redis GET commands. The documentation for the official, built-in Redis…
Matt Welke
  • 1,441
  • 1
  • 15
  • 40
2
votes
0 answers

Controlling parallelism in ParDo Transform while writing to DB

I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to…
2
votes
1 answer

Using defaultNaming for dynamic windowed writes in Apache Beam

I am following along with answer to this post and the documentation in order to perform a dynamic windowed write on my data at the end of a pipeline. Here is what I have so far: static void applyWindowedWrite(PCollection stream) { …
2
votes
1 answer

Apache Beam - BigQueryIO read Projection

I have a Dataflow pipeline that reads from a BigQuery table. However, when reading the data, there is no other option than to read all records with the read(SerializableFunction) or the readTableRows() methods. I was wondering, when using these…
Robin Trietsch
  • 1,662
  • 2
  • 19
  • 31
2
votes
1 answer

Is it possible to read a text file sequentially?

I'm using beam.io.ReadFromText to process data from textual files. Parsing the files is more complex than reading by lines (there is some state that needs to be carried and changed from line to line). Can I make Beam read my file with only one…
Zach Moshe
  • 2,782
  • 4
  • 24
  • 40
2
votes
2 answers

Write to dynamic destination to cloud storage in dataflow in Python

I was trying to read from a big file in cloud storage and shard them according to a given field. I'm planning to Read | Map(lambda x: (x[key field], x)) | GroupByKey | Write to file with the name of the key field. However I couldn't find a way to…
2
votes
1 answer

How can I pass runtime parameters to BigtableIO in Java?

According to this page, runtime parameters are not supported for BigtableIO (only for BigQuery, PubSub and Text). Is there a possible workaround or example to do so without reimplementing the class? Actually I was using CloudBigtableIO from…
2
votes
1 answer

What are the type parameters for FileBasedSink?

I'm migrating a custom sink extending FileBasedSink from version 2.0.0 to 2.2.0. The class has changed and added two extra type parameters: UserT and DestinationT: @Experimental(value=FILESYSTEM) public abstract class…
Paweł Szczur
  • 5,484
  • 3
  • 29
  • 32
2
votes
1 answer

Chaining another transform after DataStoreIO.Write

I am creating a Google dataflow pipeline, using Apache Beam Java SDK. I have a few transforms there, and I finally create a collection of Entities ( PCollection< Entity > ) . I need to write this into the Google DataStore and then, perform another…
Venky
  • 396
  • 4
  • 18
2
votes
0 answers

Dataflow worker unable to connect to Kafka through Cloud VPN

I have issues connecting a KafkaIO source to brokers available only through a Cloud VPN tunnel. The tunnel is set up to allow traffic from a specific subnetwork (secure) and routes are set up and working for compute engines in that…
2
votes
3 answers

Apache Beam with Dataflow - Nullpointer when reading from BigQuery

I am running a job on google dataflow written with apache beam that reads from BigQuery table and from files. Transforms the data and writes it into other BigQuery tables. The job "usually" succeeds, but sometimes I am randomly getting nullpointer…
2
votes
2 answers

Creating Date Partitioned Tables programmatically

Is there a way to create a date partitioned table using Apache Beam BigQueryIO, in other words, is there a way to use partition decorator for a table which is not created yet? I know that I can create a table at first and then I can use partition…
2
votes
1 answer

Apache Beam exception when running wordcount example

I think I followed very step on the document, but I still ran into this exception. (the only different is that I run this from Eclipse J2EE, but I won't expect this really maters, doesn't it?) Code: (I didn't write this, it's right from the beam…
foxwendy
  • 2,819
  • 2
  • 28
  • 50
1
vote
1 answer

Apache Beam version upgrade fails the ETL Pipeline

I am currently using Apache Beam version 2.39.0 and it is showing me errors on dataflow This version of the SDK is deprecated and will eventually no longer be supported. Learn more I am trying to upgarde from 2.39.0 to 2.48.0 and when I run mvn…