Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
3
votes
1 answer

apache beam streaming and processing multiple files at same time and windowed joins?

I just read this article https://medium.com/bb-tutorials-and-thoughts/how-to-create-a-streaming-job-on-gcp-dataflow-a71b9a28e432 What I am truly missing here though is if I drop 50 files and this is a streaming job like the article says(always…
Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
3
votes
2 answers

I am getting this error Getting SEVERE Channel ManagedChannelImpl{logId=1, target=bigquerystorage.googleapis.com:443} was not shutdown properly

I have created a beam script to get data from kafka and push it to BigQuery using Apache Beam. For now I am using java-direct-runner and just need to push data to my bigquery. This is my code:- package com.knoldus.section8; import…
3
votes
1 answer

How to generate gcs files one after another with google cloud dataflow and java?

I have a pipeline with one gcs file as input and generate two gcs output files. One output file contains error info and another contains normal info. And I have a cloud function with the gcs trigger of the two output files. I want to do something…
3
votes
1 answer

ReadFromKafka throws ValueError: Unsupported signal: 2

Currently I try to get the hang of apache beam together with apache kafka. The Kafka service is running (locally) and I write with the kafka-console-producer some test messages. First I wrote this Java Codesnippet to test apache beam with a language…
3
votes
0 answers

KafkaIO GroupId after restart

I am using Apache Beam's KafkaIO to read from a Kafka topic. Everything is working as expected, but if my job is terminated and restarted, there is a new groupID that is generated by the new job hence it ends up reading from the beginning of the…
user3693309
  • 343
  • 4
  • 14
3
votes
1 answer

Apache Beam Python Windowing and GroupByKey

LE: TL;DR; How do I create an unbounded data source in Python? Is it possible ? I'm building a streaming dataflow which will continuosly processes float values coming from sensors which have a timestamp, id, and a reading value, put the values in…
3
votes
1 answer

How should I access to AWS S3 bucket in apache beam pipeline which are located in different regions

I have to read data from two different bucket(bucket1 and bucket2) which are located in different regions(us-east-1 and us-east-2), The Apache beam pipeline is as follows, AWSCredentials credentials = new BasicAWSCredentials("*********",…
3
votes
1 answer

Apache Beam with DirectRunner (SUBPROCESS_SDK) uses only one worker, how do I force it to use all available workers?

The following code: def get_pipeline(workers): pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)]) return beam.Pipeline(options=pipeline_options, runner=fn_api_runner.FnApiRunner( …
Chris Su
  • 31
  • 1
3
votes
2 answers

open_file in beam.io FileBasedSource issue with python 3

I am using CSVRecordSource to read the CSV in Apache Beam pipeline that uses open_file in read_records function. With python 2 everything worked fine, but when I migrated to python 3 it complains about below next(csv_reader) _csv.Error: iterator…
tank
  • 465
  • 8
  • 22
3
votes
1 answer

Using MatchFiles() in apache beam pipeline to get file name and parse json in python

I have a lot of json files in a bucket and using python 3 I want to get the file name and then create a key value pair of the files and read them. Match files is now working for python I believe but I was wondering how I would implement this: files…
WIT
  • 1,043
  • 2
  • 15
  • 32
3
votes
0 answers

WebSocket connector for Apache Beam (Java)?

I have an Apache Beam pipeline wrote in Java where I would like to read data that comes from a Websocket. I have been looking for some connectors but so far my search was unsuccessful.
Fernando
  • 31
  • 1
3
votes
2 answers

Exception Handling in Apache Beam pipelines when writing to database using Java

When writing simple records to a table in Postgres (could be any db) at the end of a pipeline, some of the potential records violate uniqueness constraints and trigger an exception. As far as I can tell, there's no straight forward way to handle…
3
votes
2 answers

Join 2 unbounded Pcollections on key

I am trying to join two unbounded PCollection that I am getting from 2 different kafka topics on the basis of a key. As per the docs and other blogs a join can only be possible if we do windowing. Window collects the messages from both the streams…
3
votes
1 answer

GCS file streaming using dataflow(apachebeam python)

I have a GCS where i get file every minute.I have created a streaming dataflow by using apache beam python sdk.i created pub/sub topic for input gcs bucket and output gcs bucket.my dataflow is streaming yet my output is not getting stored in the…
user11118940
3
votes
1 answer

Pipeline fails when addng ReadAllFromText transform

I am trying to run a very simple program in Apache Beam to try out how it works. import apache_beam as beam class Split(beam.DoFn): def process(self, element): return element with beam.Pipeline() as p: rows = (p |…
Raheel
  • 8,716
  • 9
  • 60
  • 102
1 2
3
35 36