Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

Why CustomOptions in Apache Beam is not inheriting DataflowPipelineOptions default properties?

I am new to Apache Beam and trying to run a sample read and write program using DirectRunner and DataflowRunner. In my use case, there are few CLI args and to achieve this I created one interface "CustomOptions.java" which extends…
0
votes
1 answer

Read Parquet file using Apache Beam Java SDK without providing schema

It seems that the org.apache.beam.sdk.io.parquet.ParquetIO.readFiles method requires a schema to be passed in. Is there a way to avoid the need to pass in the schema? Isn't the schema included in the Parquet file? What if I am trying to read…
3thanZ
  • 133
  • 1
  • 1
  • 4
0
votes
1 answer

Apache Beam Python SDK - Read GZIP compressed Parquet file from GCS

I would like to read a GZIP compressed Parquet file from GCS into BigQuery using the Python SDK for Apache Beam. However the apache_beam.io.parquetio.ReadFromParquet method doesn't seem to support reading from compressed files. According to the…
3thanZ
  • 133
  • 1
  • 1
  • 4
0
votes
2 answers

Apache beam get kafka data execute SQL error:Cannot call getSchema when there is no schema

I will input data of multiple tables to kafka, and beam will execute SQL after getting the data, but now there are the following errors: Exception in thread "main" java.lang.IllegalStateException: Cannot call getSchema when there is no schema …
smarctor
  • 3
  • 1
0
votes
1 answer

Japanese characters getting corrupted, When running apache beam on google data flow

I am running apache beam pipeline on google dataflow. It reads data from GCS bucket and after processing it writes to GCS bucket. This pipeline processes Japanese data. In stack driver log, japanese character are showing properly. But when i see the…
Aditya
  • 207
  • 2
  • 13
0
votes
1 answer

Splittable DoFn causing Shuffle key too large problem

I am trying to implement a ListFlatten function, I have implemented it using SimpleDoFn which is working fine but for parallelizing. I am converting the function to Splittable Do Function. I managed to get a unit test running in local with 5000…
0
votes
1 answer

Use pipeline data to query BigQuery apache_beam

I want to use data from the data that runs in my pipeline to generate a query and execute it on BigQuery. Let's say I have something like this python SQL template: template = ''' SELECT email FROM `project_id.dataset_id.table_id` WHERE email =…
0
votes
2 answers

Order Google Cloud Pub/Sub messages - java sample program

I'm trying to write a simple consumer java program which consumes messages from Google Cloud Pub/Sub and do de-duplication and ordering of the messages. I failed to find a simple sample program which do that. I've read google documentation and they…
Eliyahu Machluf
  • 1,251
  • 8
  • 17
0
votes
1 answer

How to pass Side Inputs/Extra Input to JdbcIO RowMapper Java

I am trying to read cloud SQL table in java beam using JdbcIO.Read. I want to convert each row in Resultset into GenericData.Record using .withRowMapper(Resultset resultSet) method. Is there a way I can pass JSON Schema String as input in…
Onkar
  • 297
  • 5
  • 9
0
votes
1 answer

NotImplementedError apache beam python

I'm writing a json to gcs using apache beam. But encountered the following error NotImplementedError: offset: 0, whence: 0, position: 50547, last: 50547 [while running 'Writing new data to gcs/write data…
bigbounty
  • 16,526
  • 5
  • 37
  • 65
0
votes
1 answer

Apache beam with redis - select database and read from hash?

I am starting out with Apache Beam, and I would like to read from a hash that I have stored in Redis, and I will also need to select the database (number). I looked at the source of RedisIO, but it does not seem like it includes the ability to do…
Steve Storck
  • 793
  • 6
  • 25
0
votes
3 answers

Delete Big query table using Apache Beam java

Is it possible to delete a table available in bigQuery using Apache beam using Java? p.apply("Delete Table name", BigQueryIO.readTableRows().fromQuery("DELETE FROM Table_name where condition"));
0
votes
1 answer

How do I use output of beam.io.ReadFromText as a input in my python class?

I am reading one Json file in dataflow pipeline using beam.io.ReadFromText, When I pass its output to any of the class (ParDo) it will become element. I wanted to use this json file content in my class, How do I do this? Content in Json…
0
votes
1 answer

Beam apply PTransform to values while preserving key

I seem to be struggling with this pattern in Beam. This is a streaming pipeline. At a high level: message comes in to Rabbit message contents include an ID and N S3 file paths I want to produce some aggregation across all S3 files listed, but the…
drobert
  • 1,230
  • 8
  • 21
0
votes
2 answers

Apache Beam Java SDK SparkRunner write to parquet error

I'm using Apache Beam with Java. I'm trying to read a csv file and write it to parquet format using the SparkRunner on a predeployed Spark env, using local mode. Everything worked fine with the DirectRunner, but the SparkRunner simply wont work. I'm…
ivanm
  • 138
  • 1
  • 8