Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
2 answers

Apache Beam - Flink runner - FileIO.write - issues in S3 writes

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON events from Kafka and write the output in parquet format to S3. We write to S3 after every 10 min. We have observed that our pipeline sometimes stops writing to…
infiniti
  • 61
  • 6
0
votes
1 answer

Apache Beam / Google Cloud Dataflow big-query reader failing from second run

We have a Dataflow build using Apache Beam and deployed in GCP Dataflow infrastructure. Dataflow instance run perfectly first time, and create partition table as expected, But when it run second time onwards it wipe out the result from the dataset,…
Vijay Mohan
  • 1,056
  • 14
  • 34
0
votes
2 answers

Unable to read Pub/Sub messages with Apache Beam (Python SDK)

I'm trying to stream messages from a Pub/Sub topic with the Beam programming framework (Python SDK) and write them out to the console. This is my code (with apache-beam==2.27.0): import apache_beam as beam from apache_beam.options.pipeline_options…
0
votes
1 answer

Is there a way I can consume Google PubSub message using synchronous pull in Apache Beam job

I have already gone through the client library provided by google in the below doc. The given client library is just to poll the message from PubSub, But it will not poll continuously until we create the Unbounded Source…
0
votes
1 answer

Is a Source that has unknown but limited elements considered BoundSource or UnboudSource?

Is a Source that has unknown but limited elements considered BoundSource or UnboudSource? If I would be able to implement both BoundSource and UnboudSource, which one is "better"? By "better" I mean which would offer more options or better…
spam
  • 1,853
  • 2
  • 13
  • 33
0
votes
1 answer

Is there withFormatFunction equivalent in Apache Beam Python SDK?

I'm passing a PCollection of dictionary to WriteToBigQuery class. However, some fields of the dictionary aren't meant to be written to BigQuery tables. They're important to decide the table name for the element (in streaming mode). This is done by…
0
votes
0 answers

Apache beam python fileio.WriteToFiles oversharding

I'm using fileio.WriteToFiles in a streaming python pipeline. I explicitly specified an expected shard count as follows fileio.WriteToFiles( path=..., file_naming=fileio.default_file_naming(prefix="output", suffix=".txt"), …
Jiayuan Ma
  • 1,891
  • 2
  • 13
  • 25
0
votes
1 answer

Why Does Apache Beam BigQueryIO Use Same JobId per Run?

I'm running a batch Dataflow job which reads all the rows from a BigQuery table, converts them into JSON strings, and then writes the strings out to a PubSub topic. This template will be reused with the same or different parameters and should always…
0
votes
1 answer

Apache Beam - what are the limits of Deduplication function

I've a Google dataflow pipeline, build using Apace Beam. The application receives about 50M records everyday, now to ignore duplicate records, we are planning to use the Deduplication Function provided by beam framework. The document doesn't states…
0
votes
1 answer

Is there a way to completely swap out the way serialization is handled with Apache Beam?

I'm using Kotlin with Apache Beam and I have a set of DTOs that reference each other and all serialize great for any encoder with Kotlinx Serialization. When I try to use them with Beam I end up having issues because it's looking for all objects,…
0
votes
1 answer

Unable to access PCollection outside with block

for table_name, key_pair in relation_repl_key.items(): try: with beam.Pipeline(options=PipelineOptions()) as p: PCollection = p | "Reading from source database" >> relational_db.ReadFromDB( source_config=source_config, …
Souvik Dey
  • 653
  • 1
  • 9
  • 18
0
votes
2 answers

Apache Beam GroupByKey outputting duplicate elements with PubSubIO

We need to group PubSub messages by one of the fields from messages. We used fixed window of 15mins to group these messages. When run on data flow, the GroupByKey used for messages grouping is introducing too many duplicate elements, another…
0
votes
1 answer

Apache Beam logs messages with the wrong tags

Error logs don't log in the GCP console. Warning logs do log as info (so I've been using them to log info messages). E.g., test = "hello debug world" logging.warning("%s", test) # will log as info message in GCP dataflow console Info logs don't…
0
votes
1 answer

How to create tar.gz file using apache beam

I used the below to create a tar.gz file, and .gz file was created but tar file was not available. How to achieve the result? PCollection lines = pipeline.apply("To read from file", TextIO.read().from(
sathiya raj
  • 35
  • 1
  • 5
0
votes
1 answer

Apache Beam Streaming pipeline with sequential batches

What I am trying to do: Consume json messages from PubSub subscription using Apache Beam Streaming pipeline & Dataflow Runner Unmarshal payload strings into objects. Assume 'messageId' is the unique Id of incoming message. Ex: msgid1, msgid2,…