Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions

votes

2 answers

Apache Beam - Flink runner - FileIO.write - issues in S3 writes

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON events from Kafka and write the output in parquet format to S3. We write to S3 after every 10 min. We have observed that our pipeline sometimes stops writing to…

apache-flink apache-beam apache-beam-io

asked Feb 22 '21 at 08:28

infiniti

votes

1 answer

Apache Beam / Google Cloud Dataflow big-query reader failing from second run

We have a Dataflow build using Apache Beam and deployed in GCP Dataflow infrastructure. Dataflow instance run perfectly first time, and create partition table as expected, But when it run second time onwards it wipe out the result from the dataset,…

google-cloud-dataflow apache-beam apache-beam-io

asked Jan 27 '21 at 20:37

Vijay Mohan

1,056
14
34

votes

2 answers

Unable to read Pub/Sub messages with Apache Beam (Python SDK)

I'm trying to stream messages from a Pub/Sub topic with the Beam programming framework (Python SDK) and write them out to the console. This is my code (with apache-beam==2.27.0): import apache_beam as beam from apache_beam.options.pipeline_options…

python streaming apache-beam google-cloud-pubsub apache-beam-io

asked Jan 21 '21 at 15:20

shmbrg

votes

1 answer

Is there a way I can consume Google PubSub message using synchronous pull in Apache Beam job

I have already gone through the client library provided by google in the below doc. The given client library is just to poll the message from PubSub, But it will not poll continuously until we create the Unbounded Source…

google-cloud-platform google-cloud-dataflow apache-beam publish-subscribe apache-beam-io

asked Jan 09 '21 at 19:56

miles212

votes

1 answer

Is a Source that has unknown but limited elements considered BoundSource or UnboudSource?

Is a Source that has unknown but limited elements considered BoundSource or UnboudSource? If I would be able to implement both BoundSource and UnboudSource, which one is "better"? By "better" I mean which would offer more options or better…

apache-beam apache-beam-io

asked Dec 28 '20 at 17:18

spam

1,853
2
13
33

votes

1 answer

Is there withFormatFunction equivalent in Apache Beam Python SDK?

I'm passing a PCollection of dictionary to WriteToBigQuery class. However, some fields of the dictionary aren't meant to be written to BigQuery tables. They're important to decide the table name for the element (in streaming mode). This is done by…

python google-bigquery apache-beam apache-beam-io

asked Dec 22 '20 at 01:06

smanurung

votes

0 answers

Apache beam python fileio.WriteToFiles oversharding

I'm using fileio.WriteToFiles in a streaming python pipeline. I explicitly specified an expected shard count as follows fileio.WriteToFiles( path=..., file_naming=fileio.default_file_naming(prefix="output", suffix=".txt"), …

python apache-beam-io apache-beam

asked Dec 15 '20 at 20:43

Jiayuan Ma

1,891
2
13
25

votes

1 answer

Why Does Apache Beam BigQueryIO Use Same JobId per Run?

I'm running a batch Dataflow job which reads all the rows from a BigQuery table, converts them into JSON strings, and then writes the strings out to a PubSub topic. This template will be reused with the same or different parameters and should always…

google-bigquery google-cloud-dataflow apache-beam apache-beam-io

asked Nov 30 '20 at 16:44

Timothy Lloyd

votes

1 answer

Apache Beam - what are the limits of Deduplication function

I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework. The document doesn't states…

java google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Oct 13 '20 at 11:26

kylebutters

votes

1 answer

Is there a way to completely swap out the way serialization is handled with Apache Beam?

I'm using Kotlin with Apache Beam and I have a set of DTOs that reference each other and all serialize great for any encoder with Kotlinx Serialization. When I try to use them with Beam I end up having issues because it's looking for all objects,…

java serialization apache-beam apache-beam-io kotlinx.serialization

asked Oct 01 '20 at 16:37

Justin Warkentin

9,856
4
35
35

votes

1 answer

Unable to access PCollection outside with block

for table_name, key_pair in relation_repl_key.items(): try: with beam.Pipeline(options=PipelineOptions()) as p: PCollection = p | "Reading from source database" >> relational_db.ReadFromDB( source_config=source_config, …

google-cloud-dataflow apache-beam-io apache-beam

asked Sep 26 '20 at 13:02

Souvik Dey

votes

2 answers

Apache Beam GroupByKey outputting duplicate elements with PubSubIO

We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages. When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another…

java google-cloud-dataflow apache-beam apache-beam-io

asked Sep 24 '20 at 22:30

Praveen Billampati

votes

1 answer

Apache Beam logs messages with the wrong tags

Error logs don't log in the GCP console. Warning logs do log as info (so I've been using them to log info messages). E.g., test = "hello debug world" logging.warning("%s", test) # will log as info message in GCP dataflow console Info logs don't…

google-cloud-dataflow apache-beam apache-beam-io

asked Sep 15 '20 at 00:38

josealvarez97

votes

1 answer

How to create tar.gz file using apache beam

I used the below to create a tar.gz file, and .gz file was created but tar file was not available. How to achieve the result? PCollection lines = pipeline.apply("To read from file", TextIO.read().from(

java apache-beam apache-beam-io google-dataflow

asked Sep 11 '20 at 03:39

sathiya raj

votes

1 answer

Apache Beam Streaming pipeline with sequential batches

What I am trying to do: Consume json messages from PubSub subscription using Apache Beam Streaming pipeline & Dataflow Runner Unmarshal payload strings into objects. Assume 'messageId' is the unique Id of incoming message. Ex: msgid1, msgid2,…

google-cloud-dataflow apache-beam-io apache-beam

asked Sep 07 '20 at 18:59

Praveen Billampati

Prev 1 2 3

…

35 36 Next