Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions

vote

1 answer

Unable to use KafkaIO with Flink Runner

I am trying to use KafkaIO read with Flink Runner for Beam version 2.45.0 I am seeing the following issues with the same: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: No translator known for…

apache-beam apache-beam-io beam apache-beam-kafkaio

asked Apr 11 '23 at 07:47

Aditya Tiwari

vote

1 answer

Can I use a Google credential json file for authentication in a Dataflow job?

I want to use a credential json file (or string) to authenticate a Beam job to read from a GCS bucket. Notably, the credentials are provided by a user (in an existing process so I'm stuck using the json file rather than a service account in my own…

python google-cloud-storage google-cloud-dataflow apache-beam apache-beam-io

asked Mar 14 '23 at 23:31

Patrick

vote

0 answers

Periodically refresh side input from database

I have a use case to refresh side input periodically. I've tried different possible ways, but no luck. final PCollectionView> userMap = pipeline // This is trick for emitting a single long element in every N…

apache-beam apache-beam-io

asked Mar 09 '23 at 15:56

Suresh

vote

1 answer

Accessing pubsublite message attributes in beam pipeline - Java

We have been using PubSubLite in our Go program without any issues and I just started using the Java library with Beam. Using the PubSubLite IO, we get PCollection of SequencedMessage specifically:…

apache-beam-io google-cloud-pubsublite

asked Feb 14 '23 at 21:04

psykidellic

vote

1 answer

Why this Apache Beam pipeline reading an Excel file and creating a .CSV from it is not working?

I am pretty new in Apache Beam and I am experiencing the following problem with this simple task: I am trying to create a new .csv file staring from an .xlsx Excel file. To do this I am using Apache Beam with Python 3 language and Pandas library. I…

python pandas apache-beam apache-beam-io

asked Jan 05 '23 at 17:26

AndreaNobili

40,955
107
324
596

vote

1 answer

Expansion service failed to build transform

I am trying to read from MySQL database with apache_beam.io.jdbc module (https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html) ReadFromJdbc(). When I dont specify an expansion service I get the error of ValueError: Unsupported…

java io google-cloud-dataflow apache-beam apache-beam-io

asked Dec 28 '22 at 13:52

AleksF

vote

0 answers

Big query table on top of bigtable taking too long to read in google dataflow job

I have a dataflow job that reads from bigquery table( created on top of big table). The data flow job is created using custom template in java. I need to process around 500 million records from bigquery. The issue I am facing is even to read 1…

google-cloud-dataflow apache-beam apache-beam-io google-bigquery-java

asked Dec 02 '22 at 16:46

Ankit Gautam

vote

2 answers

How to migrate file from on-prem to GCS?

I want to build an ETL pipeline that: Read files from the filesystem on-prem Write the file into a Cloud Storage bucket. Is it possible to import the files (regurarly, every day) directly with the Storage Transfer Service? Let's suppose I want to…

python google-cloud-storage etl google-cloud-dataflow apache-beam-io

asked Oct 10 '22 at 14:38

alex-mont

vote

1 answer

Not able to connect to PulsarIO using Apache Beam Java sdk

while executing below code to connect to apache pulsar using apache beam PulsarIO in java sdk. Getting below error while adding pulsar client in beam pipeline. Beam version 2.40, 2.41 javaSE 1.8 import java.io.*; import org.apache.beam.sdk.*; import…

java serialization apache-beam-io pulsar

asked Oct 03 '22 at 15:57

phani geeth

vote

1 answer

apache_beam, read data from GCS buckets during pipeline

I have a pub/sub topic, which gets message as soon as a file is created in the bucket, with the streaming pipeline, I am able to get the object path. Created file is AVRO. Now in my pipeline I want to read all the content of the different files,…

google-cloud-dataflow apache-beam apache-beam-io apache-beam-internals

asked Sep 27 '22 at 04:25

Daljeet Singh

vote

1 answer

how to save logs from c++ binary in beam python?

I have a c++ binary that uses glog. I run that binary within beam python on cloud dataflow. I want to save c++ binary's stdout, stderr and any log file for later inspection. What's the best way to do that? This guide gives an example for beam java.…

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Aug 31 '22 at 00:13

bill

vote

2 answers

Write to GCS using TextIO.write() from postgres with header

I am having a pipeline be run on GCP Dataflow where I read from an SQL instance and collect the data in a PCollection and then write that PCollection to a CSV file. It seems that while writing to CSV I cannot pass the header at Runtime (as a…

java google-cloud-dataflow apache-beam apache-beam-io

asked Aug 23 '22 at 14:01

HT770

vote

1 answer

How make BigQueryIO wait for some DoFn input

In ApacheBeam once you have some PCollection input you can do input.aplly(new ParDo()) however BigQueryIO.read() can be applied only on the Pipeline instance, so my question is how can I make BigQueryIO.read() wait till some other DoFn finishes or…

apache-beam apache-beam-io

asked Aug 09 '22 at 20:13

Dmytro Pavlov

vote

1 answer

apache beam with gcp cloud function

Iam trying to create a GCP dataflow in GCP cloud function. I have deployed a simple apache beam function which works fine but I get path error when I try to readavro file. And the same script runs when I run from my local with the parameter --runner…

google-cloud-platform google-cloud-functions apache-beam apache-beam-io

asked Aug 04 '22 at 10:20

user546298

vote

0 answers

Does Dataflow parallelize reading single file?

I am wondering if Dataflow is able to parallelize loading single, potentially huge file. I know that if for example 10 files are loaded, parallelism is applied and those files are loading in parallel. But what about loading single huge file? Does…

python google-cloud-platform apache-beam dataflow apache-beam-io

asked Aug 01 '22 at 07:36

Pav3k

Prev 1 2 3

…

35 36 Next