Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

How to ensure insert rate 1 insert per second when using ClickhouseIO

I'm using Apache Beam Java SDK to process events and write them to the Clickhouse Database. Luckily there is ready to use ClickhouseIO. ClickhouseIO accumulates elements and inserts them in batch, but because of the parallel nature of the pipeline…
0
votes
1 answer

google cloud dataflow mysql io connector using python

what is the efficient way to insert streaming records into MySQL from google-dataflow using python.Is there any IO connector as in case of bigquery?I see that or bigquery has beam.io.WriteToBigQuery. how can we use similar io connector in…
0
votes
0 answers

Apache Beam: How to read from HDFS with delegation token

hdfs_options = { "hdfs_host": "...", "hdfs_port": 50070, "hdfs_user": "..." } opts = PipelineOptions(**hdfs_options) token = run_shell_cmd('curl -s --negotiate -u : "http://nn:50070/webhdfs/v1/?op=GETDELEGATIONTOKEN"' p =…
Nitin
  • 7,187
  • 6
  • 31
  • 36
0
votes
1 answer

Apache Beam that uses time as an input

I'm looking to create a Beam input that executes every second and just outputs the time as an input. I know I can do a pcollection that is from numbers like this p.apply(Create.of(1, 2, 3, 4, 5)) .setCoder(VarIntCoder.of()) and I could…
Billy Jacobson
  • 1,608
  • 2
  • 13
  • 21
0
votes
1 answer

How to reserve the public IP(static IP) to execute google dataflow job, so that I can whitelist the IP in source application?

I want to extract data from on-premise SQL Server database using google dataflow job, so wanted to whitelist the dataflow VMs IP. To whitelist IP at on-premise SQL Server need a static IP. Please let me know if more details are required.
0
votes
2 answers

Is there any way I can limit record while performing TextIO?

I have a use case where I'm reading around billions of records, but I need to limit the record to see the data behaviour. I have a pardo where I'm analysing the limited data and performing some functionality based on that. But I'm reading entire…
miles212
  • 383
  • 3
  • 20
0
votes
1 answer

Using `add_value_provider_argument` with apache beam io functions creates "RuntimeValueProviderError"

I'm trying to create a dataflow template which takes the input parameter as a RuntimeValue. Following the example from the docs import re import apache_beam as beam from apache_beam.io import ReadFromText from apache_beam.io import…
minikomi
  • 8,363
  • 3
  • 44
  • 51
0
votes
1 answer

Can Apache Beam detect the schema (column names) of a Parquet file like Spark and Pandas?

I am new to Apache Beam and I came from Spark world where the API is so rich. How can I get the schema of a Parquet file using Apache Beam? without that I load data in memory as sometimes it risks to be huge and I am interested only in knowing the…
0
votes
0 answers

Increasing workers causes Dataflow job to hang on TextIO.Write - executes quickly using DirectRunner - Apache Beam

This program ingests records from a file, parses and saves the records to the database, and writes failure records to a Cloud Storage bucket. The test file I'm using only creates 3 failure records - when run locally the final step…
0
votes
1 answer

Is it possible to connecting Cloud Dataflow into Compute Engine in different region?

We have kafka deployed on the compute engine at asia-southeast1 region, and we need to do streaming processing on the apache beam (cloud data flow). based on my research, the only way to connecting it is via vpc network. but, unfortunately data flow…
0
votes
1 answer

Apache Beam not removing an invalid element from subscription

Just realized that my pipeline is wrong when it comes to erroneous events, they keep on being processed and never removed from the subscription. Basically I have a simple pipeline which contains a trigger that would pull those events out in a…
czr_RR
  • 541
  • 5
  • 16
0
votes
1 answer

Java Apache Beam PCollections and how to make them work?

First of all let me describe the scenario. Step 1. I have to read from a file, line by line. The file is a .json and each line has the following format: { "schema":{Several keys that are to be…
BryceSoker
  • 624
  • 1
  • 11
  • 29
0
votes
1 answer

Google Dataflow batch file processing poor performance

I'm trying to build a pipeline using Apache Beam 2.16.0 for processing large amount of XML files. Average count is seventy million per 24 hrs, and at peak load it can go up to half a billion. File sizes varies from ~1 kb to 200 kb (sometimes it can…
0
votes
0 answers

NullPointerException when using PubsubIO with Spark Runner in Apache Beam Pipeline

I have a very small illustrative Apache Beam pipeline that I'm trying to run with SparkRunner. Below is the pipeline code public class SparkMain { public static void main(String[] args) { PipelineOptions options =…
kaysush
  • 4,797
  • 3
  • 27
  • 47
0
votes
1 answer

cloud dataflow cloud sql dataflow runner giving null pointer exception

I'm trying to process considerable number of records using cloud dataflow. My source is google cloud storage and my sink is cloud SQL(MySQL). I have the following code to write to the sink(Cloud…
bigbounty
  • 16,526
  • 5
  • 37
  • 65