Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

How to create a beam template with current date as an input (updated daily) [Create from GET request]

I am trying to create a Dataflow job run daily with Cloud Scheduler. I need to get the data from an external API using GET requests, so I need the current date as an input. However, when I export the dataflow job as a template for scheduling, the…
0
votes
1 answer

GCP Apache Beam Dataflow JDBC IO Connection Error

Problem When trying to deploy an Apache Beam Pipeline on Google Cloud Platform Dataflow service which connects to a Oracle 11gR2 (11.2.0.4) database to retrieve rows, I received the following error when using the Apache Beam JdbCIO Transform: Error…
WtzqCy
  • 189
  • 4
0
votes
1 answer

Apache Beam KinesisIO Java processing pipeline - application state, error handling & fault-tolerance?

I'm working on my first Apache beam pipeline to process the data streams from AWS Kinesis. I'm familiar with concepts of Kafka on how it handles the consumers' offset/state and have experience in implementing apache storm/spark processing. After…
Neel
  • 1
0
votes
0 answers

Apache Beam not properly receiving pub/sub messages from google-cloud-storage

I've been struggling with this problem for a while and can't quite find a fix. I'm building a pipeline that takes data from a public google cloud bucket and does some transformations on it. The thing I'm struggling with right now is getting apache…
0
votes
1 answer

Apache beam spark /flink runner not getting executed in EMR(Access files from GCS)

I have an apache beam pipeline to index some data to elasticsearch. I was trying to use spark or Flink runner to run the job in AWS EMR. When I tried to run the job on a stand-alone spark on local setup, pipeline works with source files in the local…
joss
  • 695
  • 1
  • 5
  • 16
0
votes
2 answers

Is it possible to execute some code like logging and writing result metrics to GCS at the end of a batch Dataflow job?

I am using apache beam 2.22.0 (java sdk) and want to log metrics and write them to a GCS bucket after a batch pipeline finishes execution. I have tried using result.waitUntilFinish() followed by the intended code: DirectRunner- GCS object is…
0
votes
0 answers

CassandraBeamIO conversion Into Pcollection of ROWS

I am trying to read Data from Cassandra db using apache beam CassandraIO , my requirement is crating a Pcollection of Rows from cassandra db , currently my code look like this PTransform>transform=CassandraIO.read() …
bforblack
  • 65
  • 5
0
votes
1 answer

Does ElasticsearchIO for apache-beam java supports Templating and ValueProvider argument? Error While invoking templates

I was trying to create a template for Apache beam to index data to elasticsearch. The template is getting created but while invoking the template the pipeline failed with No protocol Error. It looks very odd as the error is related to the URL…
0
votes
0 answers

How do concept of checkpointing/Fault tolerance work work in apache beam?

I am working on the apache beam streaming pipeline with Kafka producer as input and consumer for the output. Can anyone help me out with checkpoint in apache-beam
0
votes
1 answer

How to use Runner_v2 for apache beam dataflow job?

My python code for dataflow job looks like below: import apache_beam as beam from apache_beam.io.external.kafka import ReadFromKafka from apache_beam.options.pipeline_options import…
0
votes
1 answer

Get worker id in a apache beam job

Is it possible to get the worker-id from a apache beam job? Or any unique identifier that can tell about the current worker ? Cause I want to use it as label for my metric. Thank you.
Xitrum
  • 7,765
  • 26
  • 90
  • 126
0
votes
1 answer

Does GCP Dataflow support kafka IO in python?

I am trying to read data from kafka topic using kafka.ReadFromKafka() method in python code.My code looks like below: from apache_beam.io.external import kafka import apache_beam as beam options = PipelineOptions() with…
0
votes
2 answers

How to infer avro schema from a kafka topic in Apache Beam KafkaIO

I'm using Apache Beam's kafkaIO to read from a topic that has an avro schema in Confluent schema registry. I'm able to deserialize the message and write to files. But ultimately i want to write to BigQuery. My pipeline isn't able to infer the…
0
votes
1 answer

How to write to BigQuery with BigQuery IO in Apache Beam?

I'm trying to set up an Apache Beam pipeline that reads from Kafka and writes to BigQuery using Apache Beam. I'm using the logic from here to filter out some coordinates:…
0
votes
2 answers

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as…