Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
2 answers

Pardo Function in google dataflow not producing any output

I am trying to create my first pipleine in dataflow, I have the same code runnign when i execute using the interactive beam runner but on dataflow I get all sort of errors, which are not making much sense to…
0
votes
1 answer

Apache Beam windowing for day

I would like to extract data using windows function on apache beam, by day timeframe. Which I worked on python and used FixedWindow for capture the data. And I had problem about consistency of data cause this code is working by count duration…
zzob
  • 993
  • 3
  • 9
  • 19
0
votes
1 answer

Apache beam reading Avro files from GCS and writing to BigQuery

Running a java job to read Avro files and have been getting errors. Looking for help on this - Here is the code - // Get Avro Schema String schemaJson = getSchema(options.getAvroSchema()); Schema schema = new Schema.Parser().parse(schemaJson); //…
0
votes
2 answers

TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()

In my usecase getting set of matching filepattern from Kafka, PCollection filepatterns = p.apply(KafkaIO.read()...); Here each pattern could match upto 300+ files. Q1. How can I use TextIO.Read() to match data from PCollection, as…
0
votes
2 answers

How to select a set of fields from input data as an array of repeated fields in beam SQL

Problem Statement: I have an input PCollection with following fields: { firstname_1, lastname_1, dob, firstname_2, lastname_2, firstname_3, lastname_3, } then I execute a Beam SQL operation such that output of…
0
votes
1 answer

beam dynamically decode avro record using schema registry

I've been trying to write a beam pipeline that reads from a kafka topic, where the topic consists of avro records. The schema for these records can change rapidly, so I want to use the Confluent Schema Registry to fetch the schema and decode the…
0
votes
1 answer

Apache Beam KafkaIO producer routing different messages to different topics

I have a usecase where the incoming data has a key that identifies different type of the data. There's a single input kafka topic where all types of data are thrown at it. The beam pipeline reads all the messages from the input kafka topic and has…
bigbounty
  • 16,526
  • 5
  • 37
  • 65
0
votes
1 answer

Google Cloud Security Account Context not working in Intellij Application Runner

I am unable to authenticate my Dataflow Beam application when I run it in Intellij Idea. This worked for me at one point recently and now it doesn't. Auth is failing with 403 forbidden '"Access Denied: Project [myProject]: User does not have…
user3205931
0
votes
1 answer

Reading from Hive through Apache Beam

Can you please suggest how to read from Hive through Apache beam and save it in Row format PCollection?
Syed Mohammed Mehdi
  • 183
  • 2
  • 5
  • 15
0
votes
1 answer

Writing tfrecords in apche_beam with java

How can I write the following code in java? If I have list of records/dicts in java how can I write the beam code to write them in tfrecords where tf.train.Examples are serialized. There are lot of examples to do that with python, below is one…
Jayendra Parmar
  • 702
  • 12
  • 30
0
votes
0 answers

Apache Beam: left outer join not emitting result

I am working on use case wherein I have two unbounded streams and want to do left join on these streams. Using a fixed-size window of 5 minutes with no allowed lateness.For join I am using java extension join library. But After Join it's not…
deep
  • 31
  • 5
0
votes
1 answer

Apache Beam - adding a delay into a pipeline

I have a simple pipeline that reads from a Pub Sub topic and writes to BigQuery. I would like to introduce a 5 minute delay between reading the message from the topic and writing it to BQ. I thought I could do this using a trigger, similarly to this…
0
votes
1 answer

Acknowledgement Kafka Producer Apache Beam

How do I get the records where an acknowledgement was received in apache beam KafkaIO? Basically I want all the records where I didn't get any acknowledgement to go to a bigquery table so that I can retry sometime later. I used the following code…
bigbounty
  • 16,526
  • 5
  • 37
  • 65
0
votes
1 answer

Apache Beam: Reading in PCollection as PBegin for a pipeline

I'm debugging this beam pipeline and my end goal is to write all of the strings in a PCollection to a text file. I've set a breakpoint at the point after the the PCollection I want to inspect is created and what I've been trying to do is create a…
user8811409
  • 499
  • 1
  • 5
  • 13
0
votes
1 answer

Is it possible to write without breaking line for each pcollection in Java?

I've just started using Java SDK Apache beam. As it is required to write files without breaking lines on elements, I'm trying to find a way to do it. Looking at below, I kind of find it similar but still I can't find the equivalent option.…