Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions

votes

1 answer

Apache Beam I/O Transforms

The Apache Beam documentation Authoring I/O Transforms - Overview states: Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink…

python apache-beam apache-beam-io

asked Oct 14 '18 at 22:59

SSG

votes

1 answer

Is there a way to perform a Redis GET command with the built-in Apache beam Redis I/O transform?

My use case for Google Cloud Dataflow is to use Redis as a cache during the pipeline, since the transformation to occur depends on some cached data. This would mean performing Redis GET commands. The documentation for the official, built-in Redis…

redis google-cloud-dataflow apache-beam apache-beam-io

asked Sep 19 '18 at 15:24

Matt Welke

1,441
1
15
40

votes

0 answers

Controlling parallelism in ParDo Transform while writing to DB

I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to…

apache-flink apache-beam flink-streaming apache-beam-io

asked May 15 '18 at 03:09

harshvardhan.agr

votes

1 answer

Using defaultNaming for dynamic windowed writes in Apache Beam

I am following along with answer to this post and the documentation in order to perform a dynamic windowed write on my data at the end of a pipeline. Here is what I have so far: static void applyWindowedWrite(PCollection stream) { …

java google-cloud-dataflow apache-beam apache-beam-io

asked May 07 '18 at 23:54

ljhennessy

votes

1 answer

Apache Beam - BigQueryIO read Projection

I have a Dataflow pipeline that reads from a BigQuery table. However, when reading the data, there is no other option than to read all records with the read(SerializableFunction) or the readTableRows() methods. I was wondering, when using these…

java google-bigquery apache-beam apache-beam-io

asked Feb 22 '18 at 14:19

Robin Trietsch

1,662
2
19
31

votes

1 answer

Is it possible to read a text file sequentially?

I'm using beam.io.ReadFromText to process data from textual files. Parsing the files is more complex than reading by lines (there is some state that needs to be carried and changed from line to line). Can I make Beam read my file with only one…

apache-beam apache-beam-io

asked Feb 18 '18 at 12:39

Zach Moshe

2,782
4
24
40

votes

2 answers

Write to dynamic destination to cloud storage in dataflow in Python

I was trying to read from a big file in cloud storage and shard them according to a given field. I'm planning to Read | Map(lambda x: (x[key field], x)) | GroupByKey | Write to file with the name of the key field. However I couldn't find a way to…

python-2.7 google-cloud-storage google-cloud-dataflow apache-beam apache-beam-io

asked Feb 15 '18 at 17:04

yiqing_hua

votes

1 answer

How can I pass runtime parameters to BigtableIO in Java?

According to this page, runtime parameters are not supported for BigtableIO (only for BigQuery, PubSub and Text). Is there a possible workaround or example to do so without reimplementing the class? Actually I was using CloudBigtableIO from…

google-cloud-dataflow apache-beam google-cloud-bigtable apache-beam-io

asked Jan 29 '18 at 12:11

Roberto Tena

votes

1 answer

What are the type parameters for FileBasedSink?

I'm migrating a custom sink extending FileBasedSink from version 2.0.0 to 2.2.0. The class has changed and added two extra type parameters: UserT and DestinationT: @Experimental(value=FILESYSTEM) public abstract class…

java apache-beam apache-beam-io

asked Dec 02 '17 at 21:38

Paweł Szczur

5,484
3
29
32

votes

1 answer

Chaining another transform after DataStoreIO.Write

I am creating a Google dataflow pipeline, using Apache Beam Java SDK. I have a few transforms there, and I finally create a collection of Entities ( PCollection< Entity > ) . I need to write this into the Google DataStore and then, perform another…

google-cloud-dataflow apache-beam apache-beam-io

asked Sep 15 '17 at 12:16

Venky

votes

0 answers

Dataflow worker unable to connect to Kafka through Cloud VPN

I have issues connecting a KafkaIO source to brokers available only through a Cloud VPN tunnel. The tunnel is set up to allow traffic from a specific subnetwork (secure) and routes are set up and working for compute engines in that…

google-cloud-dataflow apache-beam apache-beam-io

asked Jul 18 '17 at 14:31

Jonas Grabber

votes

3 answers

Apache Beam with Dataflow - Nullpointer when reading from BigQuery

I am running a job on google dataflow written with apache beam that reads from BigQuery table and from files. Transforms the data and writes it into other BigQuery tables. The job "usually" succeeds, but sometimes I am randomly getting nullpointer…

google-cloud-dataflow apache-beam apache-beam-io

asked Jun 23 '17 at 09:48

Kamil Dziublinski

votes

2 answers

Creating Date Partitioned Tables programmatically

Is there a way to create a date partitioned table using Apache Beam BigQueryIO, in other words, is there a way to use partition decorator for a table which is not created yet? I know that I can create a table at first and then I can use partition…

google-bigquery google-cloud-dataflow apache-beam-io

asked Jun 06 '17 at 12:20

Ali

votes

1 answer

Apache Beam exception when running wordcount example

I think I followed very step on the document, but I still ran into this exception. (the only different is that I run this from Eclipse J2EE, but I won't expect this really maters, doesn't it?) Code: (I didn't write this, it's right from the beam…

google-cloud-dataflow apache-beam apache-beam-io

asked Jan 13 '17 at 20:56

foxwendy

2,819
2
28
50

vote

1 answer

Apache Beam version upgrade fails the ETL Pipeline

I am currently using Apache Beam version 2.39.0 and it is showing me errors on dataflow This version of the SDK is deprecated and will eventually no longer be supported. Learn more I am trying to upgarde from 2.39.0 to 2.48.0 and when I run mvn…

java google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Jun 16 '23 at 15:13

SRJ

2,092
3
17
36

Prev 1 2 3

…

35 36 Next