Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

Beam - Error while branching PCollections

I have a pipeline that reads data from kafka. It splits the incoming data into processing and rejected outputs. Data from Kafka is read into custom class MyData and output is produced as KV Define two TupleTags with MyData. private…
user238021
  • 1,181
  • 8
  • 30
  • 44
0
votes
1 answer

Keeping min/max value in a BigTable cell

I have a problem where it would be very helpful if I was able to send a ReadModifyWrite request to BigTable where it only overwrites the value if the new value is bigger/smaller than the existing value. Is this somehow possible? Note: I thought of a…
0
votes
0 answers

beam.io.ReadFromPubSub - ImportError: No module named iam.v1

I have a simple beam pipeline in which I am reading data from pub sub and writing to a file. I am running it on direct runner, the code is as follows: pubsub_data = ( p | 'Read from pub sub' >>…
Faizan Saeed
  • 143
  • 7
0
votes
0 answers

validate parameter in ReadFromText does not work on GCP

when running the flows locally and passing the non-existing local address as input (i.e. the url of a non-existing file), the validate parameter of ReadFromText works perfectly and raises an error. However, the flow does not throw any error when…
0
votes
1 answer

How to get/add GCS File user-defined metadata using Apache Beam library [org.apache.beam.sdk.io.*]

I'm setting up a Dataflow pipeline, in which one of the action is to get/add the metadata[User-provided metadata] of a GCS file. In a standalone java app I used below method to get the metadata which is from StorageObject class but not finding…
0
votes
0 answers

How to specify multiple input files for Apache Beam ReadFromTextIO

I'm using apache-beam python 2.12.0 SDK. I'm having issues when using some special character * for beam.io.ReadFromText gs://mybucket/learning/pack_operation/20190524_1_0_/extracted-*.json This is the input to my beam Job with…
MassyB
  • 1,124
  • 4
  • 15
  • 28
0
votes
0 answers

Google Cloud Dataflow windowed data and grouping

I have a pipeline that's getting a stream of events from PubSub, applying a 1h window and then writing them to a file on Google Cloud Storage. Recently I realised sometimes there are way too many events coming in a 1h window so I also added a…
rgngl
  • 5,353
  • 3
  • 30
  • 34
0
votes
1 answer

How do I explicitly set a direct runner with other command line arguments?

I wrote this pipeline but when I run it as a jar it can not find the direct runner when I have it specified in my build.gradle, and when I try to pass the parameter --runner=direct or --runner=Directrunner. Below is my code and my build.gradle…
dmc94
  • 536
  • 1
  • 5
  • 16
0
votes
2 answers

Unable to read date format columns (int96 type) from avro-parquet schema in Apache Beam

I am facing the following exception when reading the parquet file having date column. I am using beam-sdks-java-io* 2.11.0 and parquet*-1.10 please, help me for the same. Thank You in advance. Caused by: java.lang.IllegalArgumentException: INT96 not…
Nikhil_Java
  • 81
  • 2
  • 9
0
votes
1 answer

ReadFromDatastore operation timing out on 200k+ entity read with no inequality filters, no data making it into pipeline

I'm using Google Cloud Dataflow for Python SDK to read in 200k+ entities from datastore using the ReadFromDatastore() function on a query without any filters. def make_example_entity_query(): """ make an unfiltered query on the…
0
votes
2 answers

Beam model contract for finalization of CheckpointMark's

I'm working on pipeline that reads messages from Kafka using KafkaIO, and I'm looking at commitOffsetsInFinalize() option, and KafkaCheckpointMark class. I want to achieve at-least-once message delivery semantics and want to be sure that offsets…
0
votes
1 answer

how to write to GCS with a ParDo and a DoFn in apache beam

Using apache_beam.io.filesystems.FileSystems how to write to GCS with a ParDo and a DoFn ?? I am already getting output in csv format from a pardo, do i need to write another pardo to write it to gcs or can i directly import a module to write it…
0
votes
1 answer

Google Dataflow spending hours estimating input size

I'm fairly new to Google Dataflow and I am finding that the service spends several hours estimating the input file size before actually processing data, and will often do several recounts for large input collections before failing. I'm using Apache…
sewardth
  • 347
  • 2
  • 13
0
votes
1 answer

Apache BEAM implementing an UnboundedSource - how does BEAM decide how many readers are created?

I am implementing an UnboundedReader in order to use a custom data source (based on a company-internal, subscription based Java API). When I execute a pipeline I notice that multiple instances of UnboundedReader are created. How does BEAM decide how…
alex.tashev
  • 235
  • 3
  • 10
0
votes
1 answer

How to create rolling windows with Apache Beam? Not sliding or fixed but a rolling window

Say I want to calculate the average of certain metric over the last 10 mins, after each minute and compare it to the average of the same metric over the last 20 mins, after each minute. I need 2 windows (Not 10 Sliding windows vs 20 Sliding windows)…