Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
2
votes
1 answer

Better approach to call external API in apache beam

I have 2 approaches to initialize the HttpClient in order to make an API call from a ParDo in Apache Beam. Approach 1: Initialise the HttpClient object in the StartBundle and close the HttpClient in FinishBundle. The code is as follows: public class…
2
votes
0 answers

BigQueryIO write failes to add new fields even though ALLOW_FIELD_ADDITION is sat

I'm using Apache Beam's BigqueryIO to load into bigquery, but the load job fails with error: "message": "Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: Field_name.", The below is the…
artofdoe
  • 167
  • 2
  • 14
2
votes
2 answers

I am trying to write to Amazon S3 using assumeRole via FileIO with ParquetIO

Step1 : AssumeRole public static AWSCredentialsProvider getCredentials() { if (roleARN.length() > 0) { STSAssumeRoleSessionCredentialsProvider credentialsProvider = new STSAssumeRoleSessionCredentialsProvider …
2
votes
2 answers

Reading an xml file in apache beam using XmlIo

problem statement: i am trying to read and print contents of an xml file in beam using direct runner here is the code snippet: public class BookStore{ public static void main (string args[]){ BookOptions options =…
2
votes
1 answer

How to run apache-beam in batches on a bounded data?

I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right: Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite…
2
votes
1 answer

Is it possible to use Apache Beam jdbcIO over SSH tunneling?

I need to fetch data from a Mysql server through ssh tunneling. I am using Apache Beam 2.19.0 Java JdbcIO on Google Dataflow to connect to the database. But as the database is inside a private network I need to reach the database through one in…
2
votes
2 answers

Apache Beam CloudBigtableIO read/write error handling

We have a java based Data flow pipeline which reads from Bigtable, after some processing write data back to Bigtable. We use CloudBigtableIO for these purposes. I am trying wrap my head around failure handling in CloudBigtableIO. I haven;t found any…
2
votes
1 answer

Facing Performance issue while reading files from GCS using apache beam

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files. PCollection val =…
2
votes
1 answer

What to return from apache beam pcollection to write to bigquery

I am reading beam documentation and some of stackoverflow questions/answers in order to understand how would i write a pubsub message to bigquery. As of now, I have working example of getting protobuf messages and able to decode them. the code looks…
2
votes
1 answer

CassandraIO : Inserting date field doesn't work

Following is the sample definition of the Cassandra table I have; CREATE TABLE IF NOT EXISTS test_table ( id int, data_date date, score double, PRIMARY KEY (id) ); I have created a TestTable Class which extends Serializable and the…
Yamini
  • 129
  • 1
  • 10
2
votes
1 answer

Move files to another GCS folder and perform actions after an apache beam pipeline has been executed

I created a streaming apache beam pipeline that read files from GCS folders and insert them in BigQuery, it works perfectly but it re-process all the files when i stop and run the job,so all the data will be replicated again. So my idea is to move…
2
votes
2 answers

HttpForbiddenError when trying to access Google Cloud Storage from Apache Beam

I've trying to simple access with Apache Beam to Google Cloud storage from Compute Engine VM. Sure, I've set up default application login with command gcloud auth application-default login and add access to the storage for compute engine service…
2
votes
1 answer

Does the Apache Beam Python SDK discard late data, or is it just impossible to configure lateness params?

My use case is that I'm trying to aggregate data using the Apache Beam Python SDK from a Google PubSub subscription using 1 hour windows. I've configured my pipeline windowing like so: beam.WindowInto( window.FixedWindows(60 * 60, 0), …
2
votes
1 answer

How to combine Data in PCollection - Apache beam

I am looking for combining data in a PCollection input is a CSV file customer id,customer name,transction amount,transaction type cust123,ravi,100,D cust123,ravi,200,D cust234,Srini,200,C cust444,shaker,500,D cust123,ravi,100,C …
T3D
  • 39
  • 5
2
votes
1 answer

Is there a Apache Beam + Cloud Bigtable connector in Golang?

Is there a way to access data stored in Cloud Bigtable as the input source for running Apache Beam pipelines?
Vasanti
  • 1,207
  • 2
  • 12
  • 24