Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
2
votes
1 answer

Read Google Cloud Pubsub message and write to BigQuery using Golang

I am using this code to read data from Google Cloud Pubsub: pubsubmessage := pubsubio.Read(s, project, *input, &pubsubio.ReadOptions{Subscription: sub.ID()}) and this code to write to my bigquery data set : bigqueryio.Write(s, project, *output,…
2
votes
1 answer

Efficient way to read a CSV in apache beam python

After reading some questions on StackOverflow, I have been using the below code to read CSV files on beam. Pipeline code: with beam.Pipeline(options=pipeline_options) as p: parsed_csv = (p | 'Create from CSV' >> beam.Create([input_file])) …
2
votes
1 answer

elastic_enterprise_search.AppSearch client fails in python sdk on GCloud Dataflow with urllib3 certificate error

I'm working on a DoFn that writes to Elastic Search App Search (elastic_enterprise_search.AppSearch). It works fine when I run my pipeline using the DirectRunner. But when I deploy to DataFlow the elasticsearch client fails because, I suppose, it…
2
votes
2 answers

How to handle failures when publishing to pubsub using pubsub write in apache beam

I'm developing an Apache beam pipeline to publish unbounded data into a pubsub topic. Publishing is done using pubsub IO connector PubsubIO.writeMessages(). If pubsub connection is failed during pipeline is processing, I need to capture the…
man
  • 21
  • 1
2
votes
0 answers

"Can't pickle generator objects" in Apache Beam - Python

I'm getting the following error while running the DirectRunner: can't pickle generator objects [while running 'Assign Side Columns'] The error is popping up when the beam is writing the "mapped rows" to BigQuery. Any idea where this error is coming…
2
votes
1 answer

Apache Beam update current row values based on the values from previous row

Apache Beam update values based on the values from the previous row I have grouped the values from a CSV file. Here in the grouped rows, we find a few missing values which need to be updated based on the values from the previous row. If the first…
User27854
  • 824
  • 1
  • 16
  • 40
2
votes
1 answer

How to enable parallel reading of files in Dataflow?

I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average…
Gaetan
  • 2,802
  • 2
  • 21
  • 26
2
votes
1 answer

Apache Beam creating PCollection of Custom Entities/Models with Abstract Fields

I have a use-case, where we need to create PCollection which contains fields which are of abstract data type. How to define schema and coder in such cases. This data is picked up from json files present in some data-source (local/S3, etc) for…
2
votes
2 answers

How to override default metadata.lastModifiedMillis() of Apache beam's FileIO with actual file's last modified time?

Use Case: I have to filter files based on the lastModifiedTime using Apache beam (Java) My Code: PCollection readfile = pipeline .apply(FileIO.match().filepattern(path) …
Durga
  • 31
  • 2
2
votes
2 answers

Apache Beam : unable to read message from GCP PubSub. the error is AttributeError: 'SubscriberGrpcTransport' object has no attribute 'channel'

I am developing an POC it is required for a approach evaluation. I have python, venv, apache beam and gcloud installed in my Mac. And, i have logged in gcloud pupsub. the following code, creates an subscription my Pubsub topic and read the message…
sen
  • 45
  • 1
  • 8
2
votes
1 answer

RuntimeValueProviderError when Running Dataflow Template Job

Trying to figure out why I'm getting these errors. A quick search just resulted in answers that referred to a broken version, but it doesn't seem to be the case here. Creating the template works fine, but when I run it (and as I pass the limit arg)…
2
votes
1 answer

Using Apache Beam GCP DataflowRunner to write to BigQuery (Python)

Attempting to write a pipeline in Apache Beam (Python) that will read an input file from a GCP storage bucket, apply transformations then write to BigQuery. Here is the excerpt for the Apache Beam pipeline: import logging import apache_beam as…
2
votes
1 answer

BigQuery Authorized Views from Apache Beam

I am trying to query a view in BigQuery using Apache Beam. The view has access to all of the datasets that it references. The Dataflow/GCE service account has access to the view, but not to its underlying datasets (this should not be a…
Pablo
  • 10,425
  • 1
  • 44
  • 67
2
votes
1 answer

Error while using KafkaIO in apache beam DirectRunner

I am using apache beam DirectRunner to load data from kafka topic. My code is below: conf={'bootstrap.servers':'localhost:9092'} with beam.Pipeline() as pipeline: (pipeline | …
Joseph N
  • 540
  • 8
  • 28
2
votes
1 answer

Which Runners support kafkaIO in apache beam?

I am working with apache beam. My task is to pull data from kafka topic and process in dataflow. Does dataflow support kafkaIO ? Which runners are supported for KafkaIO ?