Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
3
votes
1 answer

Kafka: exactly once semantics configuration using Apache Beam

I'm trying to configure exactly once semantics in Kafka (Apache Beam). Here are the changes what I'm going to introduce: Producer: enable.idenpotence = true transactional.id = uniqueTransactionalId Consumer: set enable.auto.commit = false //…
3
votes
1 answer

Use Dataflow failed insert WriteResult to handle table not found exception

Hi I want to dynamic create table on the fly in Dataflow pipelnie First, I capture BigQueryIO WriteResult, then use it to create table WriteResult writeResult = incomingRecords.apply( "WriteToBigQuery", …
3
votes
2 answers

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I…
3
votes
1 answer

'module' object has no attribute 'WriteToBigQuery' when running Apache Beam on Google App Engine Flex

I have a Google App Engine triggering a Cloud DataFlow pipeline. This pipeline is supposed to write the final PCollection to Google BigQuery, but I can't find a way to install the right apache_beam.io dependency. I'm running Apache Beam version…
3
votes
3 answers

Reading bulk data from a database using Apache Beam

I would like to know, how JdbcIO would execute a query in parallel if my query returns millions of rows. I have referred https://issues.apache.org/jira/browse/BEAM-2803 and the related pull requests. I couldn't understand it completely. ReadAll…
Balu
  • 456
  • 8
  • 19
3
votes
1 answer

Apache Beam : Transform an objects having a list of objects to multiple TableRows to write to BigQuery

I am working on a beam pipeline to process a json and write it to bigquery. The JSON is like this. { "message": [{ "name": "abc", "itemId": "2123", "itemName": "test" }, { "name": "vfg", "itemId": "56457", "itemName":…
3
votes
2 answers

Using CoGroupByKey with custom type ends up in a Coder error

I want to join two PCollection (from a different input respectively) and implement by following the step described here, "Joins with CoGroupByKey" section: https://cloud.google.com/dataflow/model/group-by-key In my case, I want to join GeoIP's…
Norio Akagi
  • 705
  • 1
  • 8
  • 22
3
votes
1 answer

Apache Beam - org.apache.beam.sdk.util.UserCodeException: java.sql.SQLException: Cannot create PoolableConnectionFactory (Method not supported)

I am trying to connect to a hive instance installed in cloud instance using Apache beam-dataflow. When I run this, I am getting the below exception. This is happening when I access this db using Apache beam. I have seen many related questions which…
Balu
  • 456
  • 8
  • 19
3
votes
2 answers

Programmatically generating BigQuery schema in Beam pipeline

I have a collection of homogeneous dicts, how do I write them to BigQuery without knowing the schema? The BigQuerySink requires that I specify the schema when I construct it. But, I don't know the schema: it's defined by the keys of the dicts I'm…
Greg
  • 166
  • 1
  • 10
3
votes
1 answer

Simple Apache Beam manipulations work very slow

I'm very new to Apache Beam and my Java skills are quite low, but I'd like to understand why my simple entries manipulations work so slow with Apache Beam. What I'm trying to perform is the following: I have a CSV file with 1 million of records…
Petr Razumov
  • 1,952
  • 2
  • 17
  • 32
3
votes
1 answer

apache beam bigtable Iterable mutation

I'm migrating my google dataflow java 1.9 to beam 2.0 and I'm trying to use the BigtableIO.Write .... .apply("", BigtableIO.write() .withBigtableOptions(bigtableOptions) .withTableId("twoSecondVitals")); In the…
3
votes
1 answer

Sharing schema definition between BigQuery Client Libraries and Beam IO

Background: We are using cloud data flow runner in Beam 2.0 to ETL our data to our warehouse in BigQuery. We would like to use the BigQuery Client Libraries (Beta) to create the schema of our data warehouse prior to the beam pipelines populating…
3
votes
2 answers

Why increments are not supported in Dataflow-BigTable connector?

We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at…
2
votes
0 answers

How to created an unbounded input for Beam in Go?

I'm trying to use the Go Beam Sdk to create a pipeline processing pubsub messages. github.com/apache/beam/sdks/v2/go/pkg/beam I understand that the pubsubio connector is doing external calls working only on dataflow runner. What if I want to test my…
boolangery
  • 313
  • 3
  • 6
2
votes
1 answer

Same Apache beam code works in Direct Runner but not in Dataflow runner

I have a piece of apache beam pipe code that reads from a file in the GCS bucket and prints it. It is working perfectly with the DirectRunner and prints the file output but with the Dataflow runner it is not printing anything no errors as well. Do…