Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
2
votes
1 answer

When using Beam IO ReadFromPubSub module, can you pull messages with attributes in Python? It's unclear if its supported

Trying to pull messages with attributes stored in PubSub into a Beam pipeline. I'm wondering if support has been added for Python and that's why I'm unable to read them. I see that it exists in Java. pipeline_options =…
cloudpython
  • 173
  • 1
  • 7
2
votes
1 answer

How to execute stored procedure/routine with JDBCIO (apache beam)

i'm trying to execute a postgres routine using JDBCIO for apache beam. So far I have tried: .apply(JdbcIO.write() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( …
Chimmy
  • 157
  • 2
  • 10
2
votes
2 answers

No filesystem found for scheme hdfs - org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)

I am using Cloudera Enterprise 6.1.0 version and facing this issue with apache beam 2.11 SDKs when reading or writing any file on HDFS with SparkRunner. But, with the spark, it is working. This issue is coming after upgrading Cloudera version from…
Nikhil_Java
  • 81
  • 2
  • 9
2
votes
0 answers

Beam's filter by transform: overloaded method value cannot be applied to SimpleFunction

In the following code I am trying to read a CSV file located in dataFile using Beam's TextIO and filter its header line, but I am getting a compile error with this message: Error:(ROW, COLUMN) overloaded method value by with alternatives: [T,…
Mousa
  • 2,926
  • 1
  • 27
  • 35
2
votes
1 answer

cloud pub/sub with apache beam TypeError

I want to check for new files in cloud storage for which I'm using cloud pub/sub. After after doing some analysis I want to save it to another cloud storage. From this cloud storage I will send files into BigQuery using another pub sub, and template…
user11118940
2
votes
1 answer

How to add additional field to beam FileIO.matchAll() result?

I have a PCollection of KV where key is gcs file_patterns and value is some additional info of the files (e.g., the "Source" systems that generated the files). E.g., KV("gs://bucket1/dir1/*", "SourceX"), KV("gs://bucket1/dir2/*", "SourceY") I need…
2
votes
1 answer

default coder for pojo object in apache beam

As per apache beam documentation, I can find data type specific coders and also custom coders. It provides feasibility to create custom coders by registering with code registry. But I would like to know if there is any coder available for POJO/bean.…
code tutorial
  • 554
  • 1
  • 5
  • 17
2
votes
0 answers

TensorFlow Transform Python using AWS S3 as data source

I am trying to run TensorFlow Transform, using Python, Apache Flink as the Beam Runner. I noticed that Beam does not have AWS S3 as the io connector, and would like to know any work around for this. Here is the list of supported io connectors, but…
Happy Gene
  • 502
  • 3
  • 7
2
votes
1 answer

No translator error while running Apache Beam job on Flink cluster

I created a very simple apache beam job for test, it is written in scala and looks like this: object Test { def main(args: Array[String]): Unit = { val options = PipelineOptionsFactory.fromArgs(args: _*).create() val p =…
Xiang Zhang
  • 2,831
  • 20
  • 40
2
votes
1 answer

What are the differences between `WriteToBigQuery` and `BigQuerySink`

Following this answer I wonder what are the principal differences (if any) between WriteToBigQuery and BigQuerySink of the Apache Beam Python SDK. What are the considerations or limitations of using one over another?
kuza
  • 2,761
  • 3
  • 22
  • 56
2
votes
0 answers

Dynamic folder name in dynamicWrite FileIO in data-flow pipeline

I have a PCollection>. I want to group data by K and write all values for a Key K into file(s) on google storage inside a folder named K. Suppose I have 2 entries after using by to group the…
2
votes
1 answer

Reading from PubSubIO: fromTopic vs fromSubscription

I saw in some example code that appears to read directly from a topic? PubsubIO.readStrings().fromTopic(fullTopic)) Are there differences between that and PubsubIO.readStrings().fromSubscription(fullTopic)) (I was always under the impression you…
2
votes
4 answers

Unable to provide a Coder for org.apache.hadoop.hbase.client.Mutation using HBaseIO with FlinkRunner

I am running into an issue that "Unable to provide a Coder for org.apache.hadoop.hbase.client.Mutation." using HbaseIO with FlinkRunner. The Exception is below: Exception in thread "main" java.lang.IllegalStateException: Unable to return a default…
K Fred
  • 81
  • 4
2
votes
1 answer

How to read large files from HTTP response in Apache Beam?

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?
d4nielfr4nco
  • 635
  • 1
  • 6
  • 17
2
votes
1 answer

IN Apache Beam how to handle exceptions/errors at Pipeline-IO level

i am using running spark runner as pipeline runner in apache beam and found an error. by getting the error, my question araised. I know the error was due to incorrect Column_name in sql query but my question is how to handle an error/exception at IO…
jithu
  • 33
  • 7