Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
4
votes
0 answers

How can I retain domain objects after a BigQueryIO write?

My team has a Beam pipeline where we're writing an unbounded PCollection of domain objects to BigQuery using the BigQueryIO.write() function. We're transforming the domain objects into TableRow objects inside of the…
4
votes
1 answer

ReadFromKafka stuck in beam process with Dataflow

I am trying to read from a kafka topic using Apache Beam and Dataflow, print the data to the console and finally write them to a pubsub topic. But it seems to get stuck in the ReadFromKafka function. There are many data written into the kafka topic,…
4
votes
0 answers

How to debug further this dropped record in apache beam?

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test…
Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
4
votes
2 answers

How to perform checkpointing in apache beam while using flink runner?

I am reading from an unbound source (Kafka) and writing its wordcount to other Kafka topic. Now I want to perform checkpoint in beam Pipeline. I have followed all the instructions in the apache beam documentation but checkpoint directory is not…
4
votes
2 answers

external api call in apache beam dataflow

I have an use case where, I read in the newline json elements stored in google cloud storage and start processing each json. While processing each json, I have to call an external API for doing de-duplication whether that json element was discovered…
bigbounty
  • 16,526
  • 5
  • 37
  • 65
4
votes
1 answer

How can we read CSV Files with enclosure in Apache Beam using python sdk?

I am reading a comma-separated CSV file where the fields are enclosed in double quotes, and some of them also have commas within their values, like: "abc","def,ghi","jkl" Is there a way we can read this file into a PCollection using Apache Beam?
vaibhav v
  • 83
  • 2
  • 4
4
votes
0 answers

Read protocol buffer files in apache beam

I have a bunch of protobuff files in GCS and I would like to process them through dataflow (java sdk) and I am not sure how to do that. Apache beam provides AvroIO to read avro files Schema schema = new Schema.Parser().parse(new…
Pari
  • 1,443
  • 3
  • 19
  • 34
4
votes
1 answer

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO. The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's. When I don't use any windows, I get the following error: Exception in thread…
4
votes
2 answers

Writing an an unbounded collection to GCS

I have seen many questions on the same topic. But, I am still having problem with writing to GCS. I am reading the topic from pubsub and trying to push this to GCS. I have referred to this link. But, couldn't find the IOChannelUtils in the latest…
Balu
  • 456
  • 8
  • 19
4
votes
1 answer

Group elements in Apache Beam pipeline

I have got a pipeline that parses records from AVRO files. I need to split the incoming records into chunks of 500 items in order to call an API that takes multiple inputs at the same time. Is there a way to do this with the Python SDK?
3
votes
1 answer

Apache Beam Write PubSub messages using Go

I'm new to Go and trying to read a table from BigQuery and publish as messages using PubSub. I searched online and came up with the below code. package main import ( "context" "flag" "reflect" …
3
votes
2 answers

Apache Beam KinesisIO Java - Consume the data in a kinesis stream from where it left

First I want to say that am totally new to Beam world. I'm working on an Apache Beam focused task and my main data source is a Kinesis stream. In there, when I consuming the streaming data, I noticed that the same set of data is coming when I…
Prasad
  • 83
  • 1
  • 8
3
votes
1 answer

How to mock the results of BigQueryIO.read for unit testing?

I have an Apache Beam pipeline that reads data from BigQuery using a query that joins multiple tables. I want to test the entire pipeline locally using mock data (i.e. without connecting to BigQuery). Can I do this using…
3
votes
2 answers

Google cloud dataflow unable to open flex template file

After running a deployment script to launch a dataflow flex job, I get "failed to read the job file : gs://dataflow-staging-europe-west2/------/staging/template_launches/{JOBNAME}/job_object with error message: (7ea9e263ad5cddb5): Unable to open…
3
votes
1 answer

Reading files from a SFTP location using Apache Beam

I just have a few questions on achieving the $subject. I have an FTP location and I want to use a beam pipeline to read these files and do some processing. I basically want to read the file list from the FTP location every one minute and do the…
turingMan
  • 147
  • 1
  • 9
1
2
3
35 36