Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions

votes

1 answer

Read Google Cloud Pubsub message and write to BigQuery using Golang

I am using this code to read data from Google Cloud Pubsub: pubsubmessage := pubsubio.Read(s, project, *input, &pubsubio.ReadOptions{Subscription: sub.ID()}) and this code to write to my bigquery data set : bigqueryio.Write(s, project, *output,…

go google-cloud-dataflow apache-beam apache-beam-io

asked Sep 14 '22 at 13:33

Abyakta Bal

votes

1 answer

Efficient way to read a CSV in apache beam python

After reading some questions on StackOverflow, I have been using the below code to read CSV files on beam. Pipeline code: with beam.Pipeline(options=pipeline_options) as p: parsed_csv = (p | 'Create from CSV' >> beam.Create([input_file])) …

python google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Jun 02 '22 at 02:26

Akhil Kv

votes

1 answer

elastic_enterprise_search.AppSearch client fails in python sdk on GCloud Dataflow with urllib3 certificate error

I'm working on a DoFn that writes to Elastic Search App Search (elastic_enterprise_search.AppSearch). It works fine when I run my pipeline using the DirectRunner. But when I deploy to DataFlow the elasticsearch client fails because, I suppose, it…

python-3.x elasticsearch google-cloud-dataflow apache-beam apache-beam-io

asked Mar 24 '22 at 20:21

Robert Moskal

21,737
8
62
86

votes

2 answers

How to handle failures when publishing to pubsub using pubsub write in apache beam

I'm developing an Apache beam pipeline to publish unbounded data into a pubsub topic. Publishing is done using pubsub IO connector PubsubIO.writeMessages(). If pubsub connection is failed during pipeline is processing, I need to capture the…

apache-beam publish-subscribe apache-beam-io

asked Nov 25 '21 at 17:53

man

votes

0 answers

"Can't pickle generator objects" in Apache Beam - Python

I'm getting the following error while running the DirectRunner: can't pickle generator objects [while running 'Assign Side Columns'] The error is popping up when the beam is writing the "mapped rows" to BigQuery. Any idea where this error is coming…

python google-bigquery apache-beam apache-beam-io

asked Nov 03 '21 at 09:40

Thomas GUILLAUX

votes

1 answer

Apache Beam update current row values based on the values from previous row

Apache Beam update values based on the values from the previous row I have grouped the values from a CSV file. Here in the grouped rows, we find a few missing values which need to be updated based on the values from the previous row. If the first…

java apache-beam apache-beam-io apache-beam-internals

asked Nov 01 '21 at 21:46

User27854

votes

1 answer

How to enable parallel reading of files in Dataflow?

I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average…

python google-cloud-dataflow apache-beam-io

asked Aug 17 '21 at 16:44

Gaetan

2,802
2
21
26

votes

1 answer

Apache Beam creating PCollection of Custom Entities/Models with Abstract Fields

I have a use-case, where we need to create PCollection which contains fields which are of abstract data type. How to define schema and coder in such cases. This data is picked up from json files present in some data-source (local/S3, etc) for…

java google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked May 11 '21 at 19:05

Satyam Khare

votes

2 answers

How to override default metadata.lastModifiedMillis() of Apache beam's FileIO with actual file's last modified time?

Use Case: I have to filter files based on the lastModifiedTime using Apache beam (Java) My Code: PCollection readfile = pipeline .apply(FileIO.match().filepattern(path) …

google-cloud-dataflow apache-beam apache-beam-io

asked Mar 28 '21 at 20:33

Durga

votes

2 answers

Apache Beam : unable to read message from GCP PubSub. the error is AttributeError: 'SubscriberGrpcTransport' object has no attribute 'channel'

I am developing an POC it is required for a approach evaluation. I have python, venv, apache beam and gcloud installed in my Mac. And, i have logged in gcloud pupsub. the following code, creates an subscription my Pubsub topic and read the message…

python apache-beam apache-beam-io

asked Feb 08 '21 at 09:46

sen

votes

1 answer

RuntimeValueProviderError when Running Dataflow Template Job

Trying to figure out why I'm getting these errors. A quick search just resulted in answers that referred to a broken version, but it doesn't seem to be the case here. Creating the template works fine, but when I run it (and as I pass the limit arg)…

python apache-beam dataflow apache-beam-io

asked Feb 03 '21 at 23:43

CaptainBarefoot

votes

1 answer

Using Apache Beam GCP DataflowRunner to write to BigQuery (Python)

Attempting to write a pipeline in Apache Beam (Python) that will read an input file from a GCP storage bucket, apply transformations then write to BigQuery. Here is the excerpt for the Apache Beam pipeline: import logging import apache_beam as…

python google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Dec 14 '20 at 16:03

csukcl

votes

1 answer

BigQuery Authorized Views from Apache Beam

I am trying to query a view in BigQuery using Apache Beam. The view has access to all of the datasets that it references. The Dataflow/GCE service account has access to the view, but not to its underlying datasets (this should not be a…

google-bigquery google-cloud-dataflow apache-beam apache-beam-io

asked Aug 12 '20 at 21:20

Pablo

10,425
1
44
67

votes

1 answer

Error while using KafkaIO in apache beam DirectRunner

I am using apache beam DirectRunner to load data from kafka topic. My code is below: conf={'bootstrap.servers':'localhost:9092'} with beam.Pipeline() as pipeline: (pipeline | …

python ubuntu apache-kafka apache-beam apache-beam-io

asked Jul 07 '20 at 19:22

Joseph N

votes

1 answer

Which Runners support kafkaIO in apache beam?

I am working with apache beam. My task is to pull data from kafka topic and process in dataflow. Does dataflow support kafkaIO ? Which runners are supported for KafkaIO ?

apache-kafka google-cloud-dataflow apache-beam apache-beam-io

asked Jul 07 '20 at 17:44

Joseph N

Prev 1 2 3

…

35 36 Next