Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
0 answers

Error trying to implement a mongoDB IO connector Sink

we started implementing a mongoDB IO Connector for first Apache Beam test, and the Source part seems to work properly. Concerning the Sink part, the execution leads to an error... We used this guidelines for the implementation:…
Pascal Gula
  • 1,143
  • 1
  • 8
  • 18
0
votes
1 answer

Apache Beam Input from Ports

[Python - Beam SDK] I would like to be able to test timing issues in integration tests, so I want to build a generator system that pipes in messages into my Beam application with timestamps I specify. My current idea is to an application write to…
0
votes
1 answer

How to read multiple datastore kinds in cloud dataflow python pipeline

I'm trying to read multiple datastore kinds from default namespace in my python pipeline and want to work on them. The functions that I wrote works fine locally with DirectRunner but when I run the pipline on cloud using DataflowRunner, One of the…
0
votes
2 answers

Scio all saveAs txt file methods output a txt file with part prefix

If I want to output a SCollection of TableRow or String to google cloud storage (GCS) I'm using saveAsTableRowJsonFile or saveAsTextFile, respectively. Both of these methods ultimately use private[scio] def pathWithShards(path: String) =…
0
votes
1 answer

Error executing Apache BEAM sql query - Use a Window.into or Window.triggering transform prior to GroupByKey

How do I include Window.into or Window.triggering transform prior to GroupByKey in BEAM SQL? I have following 2 tables: Ist table CREATE TABLE table1( field1 varchar ,field2 varchar ) 2nd Table CREATE TABLE table2( field1 varchar ,field3…
Ritesh Sinha
  • 820
  • 5
  • 22
  • 50
0
votes
1 answer

Dataflow: How to create a pipeline from an already existing PCollection spewed by another pipeline

I am trying split my pipeline into many smaller pipelines so they execute faster. I am partitioning a PCollection of Google Cloud Storage blobs (PCollection)so that I get a PCollectionList collectionList from there I would love to be…
user9773014
0
votes
0 answers

Setting Environment variables in apache beam while creating a datasource for cloudsql/mysql for python apache beam sdk

Hi I am trying to create a datasource for apache beam in python. I know with Java you can connect to cloudsql using the JDBC library. Similarly I am trying to create a source for dataflow(apache beam) in Google Cloud Platform. I have inherited from…
0
votes
1 answer

triggering at fixed intervals in apache beam streaming

I am using apache beam to write some streaming pipelines. One requirement for my use case is that i want to trigger every X minutes relative to window start or end time. how can i achieve this. The current trigger…
0
votes
1 answer

Apache Beam Python SDK with Pub/Sub source stuck at runtime

I am writing a program in Apache Beam using Python SDK to read from Pub/Sub the contents of a JSON file, and do some processing on the received string. This is the part in the program where I pull contents from Pub/Sub and do the processing: with…
Arjun Kay
  • 308
  • 1
  • 12
0
votes
1 answer

How to create dependency between tasks in Apache beam python

I am new to apache beam and exploring python version of apache beam dataflow. I want to execute my dataflow tasks in certain order but it executes all tasks in parallel mode. How to create task dependency in apache beam python? Sample Code: (in this…
MJK
  • 1,381
  • 3
  • 15
  • 22
0
votes
0 answers

Slow BigQuery load job when data comes from Apache Beam JdbcIO

I'm trying to add rows to BigQuery from my Apache Beam pipeline using a BigQuery load job. The initial data I'm processing comes from a Postgresql database and is read into Beam with the JdbcIO datasource: @Override public PCollection
0
votes
1 answer

Issues while using Snappy for tensorflow preprocessing using BeamIO

While using Apache beamIO for preprocessing data, snappy library was a good to have module for compression but looks like the file transformation doesnt seems to work as it cannot find the crc32 compress function in the library! Im using…
0
votes
1 answer

Apache Beam Dataflow Reading big CSV with splittable=True causing duplicate entries

I used the code snippet below to read CSV files into the pipeline as Dicts. class MyCsvFileSource(beam.io.filebasedsource.FileBasedSource): def read_records(self, file_name, range_tracker): self._file = self.open_file(file_name) …
0
votes
2 answers

Read a pickle from another pipeline in Beam?

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill. The writing works well, writing a number of files, each with a…
Maximilian
  • 7,512
  • 3
  • 50
  • 63
0
votes
0 answers

apache hive integration with apache beam

I am doing a POC to connect to Apache Hive in the Apache Beam pipeline and i am getting exception similar to the below SO link. I did change the version of the JDBC driver to the latest. But still facing the issue. As mentioned in the below link it…