Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

Apache beam KafkaIO offset management to external data stores

I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline,…
0
votes
1 answer

Dataflow GroupBy -> multiple outputs based on keys

Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys? Bin.apply(GroupByKey.>>create()) .apply(ParDo.named("Print Bins").of( ... )…
AmirCS
  • 321
  • 1
  • 2
  • 14
0
votes
0 answers

Using "DISTINCT" functionality in DataStoreIO.read with Apache Beam Java SDK

I am running a dataflow job (Apache Beam SDK 2.1.0 Java, Google dataflow runner) and I need to read from the Google DataStore "distinctly" on one particular property. (like the good old "DISTINCT" keyword in SQL). Here is my code snippet :…
0
votes
1 answer

Apache Beam Template : Runtime Context Error

I'm currently trying to create dataflow-template based on the Apache Beam SDK v2.1.0 like the Google tutorial This is my main class public static void main(String[] args) { // Initialize options DispatcherOptions options =…
0
votes
1 answer

Apache Beam 2.1.0 with Google DatastoreIO calls Guava Preconditions checkArgument on non-existing function in GAE

When building a dataflow template which should read from datastore I get the following error in stackdriver logs (from Google App Engine): java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;I)V …
0
votes
1 answer

Apache Beam Program execution without using Maven

I want to run a simple example Beam Program using Apache Spark runner. 1) I was to able to compile the program in my local successfully. 2) I want to push the JAR file to QA box where Maven is not installed. 3) I see the example with Maven command…
VIjay
  • 117
  • 8
0
votes
1 answer

BigtableIO Read keys with a given prefix

I'm looking for the best way of reading all the rows with a given prefix. I see that there is a withKeyRange method in BigTableIO.Read but it requires you to specify a start-key and and an end-key. Is there a way to specify reading from a prefix?
Narek
  • 548
  • 6
  • 26
0
votes
1 answer

Apache Beam mongodb source

I have a beam pipeline which has mongodb as source but when I try to run it it throws an exception. An exception occured while executing the Java class. null: InvocationTargetException:…
guru107
  • 1,053
  • 1
  • 11
  • 28
0
votes
1 answer

Google dataflow only partly uncompressing files compressed with pbzip2

seq 1 1000000 > testfile bzip2 -kz9 testfile mv testfile.bz2 testfile-bzip2.bz2 pbzip2 -kzb9 testfile mv testfile.bz2 testfile-pbzip2.bz2 gsutil cp testfile gs://[bucket] gsutil cp testfile-bzip2.bz2 gs://[bucket] gsutil cp testfile-pbzip2.bz2…
Fernet
  • 173
  • 1
  • 10
0
votes
1 answer

Error streaming from pub/sub into big query python

I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two: apache_beam.io.gcp.pubsub.PubSubSource apache_beam.io.gcp.bigquery.BigQuerySink into lines 59 and 74 respectively in the…
0
votes
2 answers

How do I run a beam class in dataflow which access google sql instance?

When i run my pipeline from local machine, i can update the table which resides in the cloud Sql instance. But, when i moved this to run using DataflowRunner, the same is failing with the below exception. To connect from my eclipse, I created the…
0
votes
1 answer

Apache Beam throws Cannot setCoder(null) : java

I am new to Apache Beam and I am trying to connect to google cloud instance of mysql database. When I run the below code snippet, it's throwing the below exception. Logger logger = LoggerFactory.getLogger(GoogleSQLPipeline.class); …
Balu
  • 456
  • 8
  • 19
0
votes
1 answer

Sharding BigQuery output tables

I read both from the documentation and from this answer that it is possible to determine the table destination dynamically. I used exactly the similar approach as below: PCollection foos = ...; foos.apply(BigQueryIO.write().to(new…
0
votes
3 answers

Example of reading and writing transoforms with PubSub using apache beam python sdk

I see examples here https://cloud.google.com/dataflow/model/pubsub-io#reading-with-pubsubio for Java, but when I look here https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/pubsub.py its says: def reader(self): raise…
0
votes
1 answer

Apache Beam count HBase row block and not return

Start to try out the Apache Beam and try to use it to read and count HBase table. When try to read the table without the Count.globally, it can read the row, but when try to count number of rows, the process hung and never exit. Here is the very…
David Wang
  • 41
  • 7
1 2 3
35
36