Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions

votes

1 answer

When using Beam IO ReadFromPubSub module, can you pull messages with attributes in Python? It's unclear if its supported

Trying to pull messages with attributes stored in PubSub into a Beam pipeline. I'm wondering if support has been added for Python and that's why I'm unable to read them. I see that it exists in Java. pipeline_options =…

python google-cloud-pubsub apache-beam-io

asked Mar 24 '19 at 04:17

cloudpython

votes

1 answer

How to execute stored procedure/routine with JDBCIO (apache beam)

i'm trying to execute a postgres routine using JDBCIO for apache beam. So far I have tried: .apply(JdbcIO.write() .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create( …

java postgresql jdbc apache-beam apache-beam-io

asked Mar 21 '19 at 10:06

Chimmy

votes

2 answers

No filesystem found for scheme hdfs - org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)

I am using Cloudera Enterprise 6.1.0 version and facing this issue with apache beam 2.11 SDKs when reading or writing any file on HDFS with SparkRunner. But, with the spark, it is working. This issue is coming after upgrading Cloudera version from…

cloudera-manager apache-beam-io

asked Mar 13 '19 at 08:01

Nikhil_Java

votes

0 answers

Beam's filter by transform: overloaded method value cannot be applied to SimpleFunction

In the following code I am trying to read a CSV file located in dataFile using Beam's TextIO and filter its header line, but I am getting a compile error with this message: Error:(ROW, COLUMN) overloaded method value by with alternatives: [T,…

scala google-cloud-dataflow apache-beam apache-beam-io

asked Mar 11 '19 at 09:46

Mousa

2,926
1
27
35

votes

1 answer

cloud pub/sub with apache beam TypeError

I want to check for new files in cloud storage for which I'm using cloud pub/sub. After after doing some analysis I want to save it to another cloud storage. From this cloud storage I will send files into BigQuery using another pub sub, and template…

python google-cloud-dataflow apache-beam apache-beam-io

asked Mar 07 '19 at 09:43

user11118940

votes

1 answer

How to add additional field to beam FileIO.matchAll() result?

I have a PCollection of KV where key is gcs file_patterns and value is some additional info of the files (e.g., the "Source" systems that generated the files). E.g., KV("gs://bucket1/dir1/*", "SourceX"), KV("gs://bucket1/dir2/*", "SourceY") I need…

java google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Feb 23 '19 at 06:29

JacobmcDonald

votes

1 answer

default coder for pojo object in apache beam

As per apache beam documentation, I can find data type specific coders and also custom coders. It provides feasibility to create custom coders by registering with code registry. But I would like to know if there is any coder available for POJO/bean.…

apache-beam apache-beam-io

asked Feb 18 '19 at 01:46

code tutorial

votes

0 answers

TensorFlow Transform Python using AWS S3 as data source

I am trying to run TensorFlow Transform, using Python, Apache Flink as the Beam Runner. I noticed that Beam does not have AWS S3 as the io connector, and would like to know any work around for this. Here is the list of supported io connectors, but…

apache-beam-io tensorflow-transform

asked Jan 11 '19 at 17:17

Happy Gene

votes

1 answer

No translator error while running Apache Beam job on Flink cluster

I created a very simple apache beam job for test, it is written in scala and looks like this: object Test { def main(args: Array[String]): Unit = { val options = PipelineOptionsFactory.fromArgs(args: _*).create() val p =…

apache-flink apache-beam apache-beam-io

asked Jan 10 '19 at 14:41

Xiang Zhang

2,831
20
40

votes

1 answer

What are the differences between `WriteToBigQuery` and `BigQuerySink`

Following this answer I wonder what are the principal differences (if any) between WriteToBigQuery and BigQuerySink of the Apache Beam Python SDK. What are the considerations or limitations of using one over another?

google-bigquery apache-beam apache-beam-io

asked Jan 01 '19 at 08:54

kuza

2,761
3
22
56

votes

0 answers

Dynamic folder name in dynamicWrite FileIO in data-flow pipeline

I have a PCollection>. I want to group data by K and write all values for a Key K into file(s) on google storage inside a folder named K. Suppose I have 2 entries after using by to group the…

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Nov 30 '18 at 21:54

user3777228

votes

1 answer

Reading from PubSubIO: fromTopic vs fromSubscription

I saw in some example code that appears to read directly from a topic? PubsubIO.readStrings().fromTopic(fullTopic)) Are there differences between that and PubsubIO.readStrings().fromSubscription(fullTopic)) (I was always under the impression you…

java google-cloud-dataflow google-cloud-pubsub apache-beam-io

asked Nov 08 '18 at 14:56

Bryce Fischer

5,336
9
30
36

votes

4 answers

Unable to provide a Coder for org.apache.hadoop.hbase.client.Mutation using HBaseIO with FlinkRunner

I am running into an issue that "Unable to provide a Coder for org.apache.hadoop.hbase.client.Mutation." using HbaseIO with FlinkRunner. The Exception is below: Exception in thread "main" java.lang.IllegalStateException: Unable to return a default…

apache-beam apache-beam-io

asked Nov 08 '18 at 02:50

K Fred

votes

1 answer

How to read large files from HTTP response in Apache Beam?

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?

apache-beam apache-beam-io

asked Nov 04 '18 at 03:16

d4nielfr4nco

votes

1 answer

IN Apache Beam how to handle exceptions/errors at Pipeline-IO level

i am using running spark runner as pipeline runner in apache beam and found an error. by getting the error, my question araised. I know the error was due to incorrect Column_name in sql query but my question is how to handle an error/exception at IO…

apache-spark jdbc apache-beam apache-beam-io

asked Nov 01 '18 at 09:41

jithu

Prev 1 2 3

…

35 36 Next