Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
1
vote
1 answer

Unable to use KafkaIO with Flink Runner

I am trying to use KafkaIO read with Flink Runner for Beam version 2.45.0 I am seeing the following issues with the same: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: No translator known for…
1
vote
1 answer

Can I use a Google credential json file for authentication in a Dataflow job?

I want to use a credential json file (or string) to authenticate a Beam job to read from a GCS bucket. Notably, the credentials are provided by a user (in an existing process so I'm stuck using the json file rather than a service account in my own…
1
vote
0 answers

Periodically refresh side input from database

I have a use case to refresh side input periodically. I've tried different possible ways, but no luck. final PCollectionView> userMap = pipeline // This is trick for emitting a single long element in every N…
Suresh
  • 31
  • 1
  • 5
1
vote
1 answer

Accessing pubsublite message attributes in beam pipeline - Java

We have been using PubSubLite in our Go program without any issues and I just started using the Java library with Beam. Using the PubSubLite IO, we get PCollection of SequencedMessage specifically:…
1
vote
1 answer

Why this Apache Beam pipeline reading an Excel file and creating a .CSV from it is not working?

I am pretty new in Apache Beam and I am experiencing the following problem with this simple task: I am trying to create a new .csv file staring from an .xlsx Excel file. To do this I am using Apache Beam with Python 3 language and Pandas library. I…
AndreaNobili
  • 40,955
  • 107
  • 324
  • 596
1
vote
1 answer

Expansion service failed to build transform

I am trying to read from MySQL database with apache_beam.io.jdbc module (https://beam.apache.org/releases/pydoc/current/apache_beam.io.jdbc.html) ReadFromJdbc(). When I dont specify an expansion service I get the error of ValueError: Unsupported…
1
vote
0 answers

Big query table on top of bigtable taking too long to read in google dataflow job

I have a dataflow job that reads from bigquery table( created on top of big table). The data flow job is created using custom template in java. I need to process around 500 million records from bigquery. The issue I am facing is even to read 1…
1
vote
2 answers

How to migrate file from on-prem to GCS?

I want to build an ETL pipeline that: Read files from the filesystem on-prem Write the file into a Cloud Storage bucket. Is it possible to import the files (regurarly, every day) directly with the Storage Transfer Service? Let's suppose I want to…
1
vote
1 answer

Not able to connect to PulsarIO using Apache Beam Java sdk

while executing below code to connect to apache pulsar using apache beam PulsarIO in java sdk. Getting below error while adding pulsar client in beam pipeline. Beam version 2.40, 2.41 javaSE 1.8 import java.io.*; import org.apache.beam.sdk.*; import…
1
vote
1 answer

apache_beam, read data from GCS buckets during pipeline

I have a pub/sub topic, which gets message as soon as a file is created in the bucket, with the streaming pipeline, I am able to get the object path. Created file is AVRO. Now in my pipeline I want to read all the content of the different files,…
1
vote
1 answer

how to save logs from c++ binary in beam python?

I have a c++ binary that uses glog. I run that binary within beam python on cloud dataflow. I want to save c++ binary's stdout, stderr and any log file for later inspection. What's the best way to do that? This guide gives an example for beam java.…
bill
  • 650
  • 8
  • 17
1
vote
2 answers

Write to GCS using TextIO.write() from postgres with header

I am having a pipeline be run on GCP Dataflow where I read from an SQL instance and collect the data in a PCollection and then write that PCollection to a CSV file. It seems that while writing to CSV I cannot pass the header at Runtime (as a…
1
vote
1 answer

How make BigQueryIO wait for some DoFn input

In ApacheBeam once you have some PCollection input you can do input.aplly(new ParDo()) however BigQueryIO.read() can be applied only on the Pipeline instance, so my question is how can I make BigQueryIO.read() wait till some other DoFn finishes or…
1
vote
1 answer

apache beam with gcp cloud function

Iam trying to create a GCP dataflow in GCP cloud function. I have deployed a simple apache beam function which works fine but I get path error when I try to readavro file. And the same script runs when I run from my local with the parameter --runner…
1
vote
0 answers

Does Dataflow parallelize reading single file?

I am wondering if Dataflow is able to parallelize loading single, potentially huge file. I know that if for example 10 files are loaded, parallelism is applied and those files are loading in parallel. But what about loading single huge file? Does…