Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

How to create Bigtable tables and column families in a Dataflow job if they doesn't exist

I have a Cloud Dataflow job which is writing everything to a single table and a single column family. How to modify this job to write to multiple tables and column families which may or may not exist? E.g., if a table or column family doesn't exist,…
0
votes
0 answers

Connect Beam JDBC IO with Cassandra

Unable to connect Cassandra using jdbc driver getting error java.sql.SQLException: Cannot create PoolableConnectionFactory (isValid() returned false) Apache beam JDBC IO not working with Casandra I tried with working cassandra-jdbc-1.2.5.jar Here my…
0
votes
1 answer

Dataflow IO to BigTable [2.9.0]

I found this Bigtable with Dataflow example https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/blob/master/java/dataflow-connector-examples/src/main/java/com/google/cloud/bigtable/dataflow/example/HelloWorldWrite.java However; it uses…
0
votes
0 answers

Beam pipeline: Kafka to HDFS by time buckets

I am trying to bake a very simple pipeline that reads a stream of events from Kafka (KafkaIO.read) and writes the very same events to HDFS, bucketing each event together by hour (the hour is read from a timestamp field of the event, not processing…
0
votes
2 answers

Access File inside the dataflow pipeline

I want to download certain files from to the temp location before the pipeline starts.The files .mmdb files which are to be read in the ParDo fucntion.The files are stored on Google Storage but the method consuming the .mmdb files requires them to…
0
votes
0 answers

Fetching records based on batch size in apache beam

I have 100k records to be processed and I need to fetch 10k each time, process them and fetch another 10k until I process all the 100k records which I call as batch size to reduce the processing overhead each time by fetching all the records at…
0
votes
2 answers

Considering total max records from the user and processing it based on the batch size in apache beam

I am trying to read the records from the source based on the count of total max records to be processed which should be given by the user. Eg: Total Records in the source table is 1 million Total Max records to process are 100K I need to process…
0
votes
1 answer

Populate TextIO write with API call

My question revolves around kicking off a API call to get the file prefix for TextIO output. Here is what I have now (and…
MTS57017
  • 35
  • 1
  • 8
0
votes
1 answer

I know that presently order by is not there in BeamSQL is there any work around for it

This is present PCollection rec = rec_out.apply(BeamSql.query( "SELECT bnk_name,state_name,val from PCOLLECTION order by val desc limit 2")); But I need PCollection rec = rec_out.apply(BeamSql.query( "SELECT…
0
votes
1 answer

Avro "not open" exception when writing generic records using Apache Beam

I am using AvroIO.writeCustomTypeToGenericRecords() for writing generic records to GCS inside a streaming data flow job. For the first few minutes all seems to be working fine, however, after around 10 minutes, the job starts throwing…
W Khattak
  • 169
  • 1
  • 12
0
votes
1 answer

Apache Beam - Exception caught and throwed even though program excuting continuosly. How to stop that process or handle in pipeline

i have a pipeline which fetch data mysql and used to transfer data to mongo db after running this pipeline with the below code, data fetched from mysql but unable to load to mongodb noSqlresult.apply(MongoDbIO.write().withUri(mongoUri) …
jithu
  • 33
  • 7
0
votes
1 answer

Read multiple files at runtime (dataflow template)

I am trying to build a dataflow template. The goal is to read ValueProvider that will tell me what files to read. Then for each files I need to read and enrich data with the object. I have tried this: …
0
votes
1 answer

Apache Beam Python wordcount example errors on Windows10

I am running Anaconda - conda virtual env with Python 2.7 I have followed Apache Beam Python SDK Quickstart When I run - 'python -m apache_beam.examples.wordcount --input C:\Users\simon_6dagkya\OneDrive\ProgrammingCore\Apache…
0
votes
2 answers

Google Dataflow - How to specify the TextIO in java if writing to an On-prem server?

Google Dataflow - How to specify the TextIO if writing to an On-prem server from Dataflow? (Provided that the On-prem server is connected to GCP with Cloud…
0
votes
1 answer

Reading bulk data from BigQuery using joins

I have a use case wherein I have to read selected data from BigQuery by applying left joins on 20 different BQ tables, apply transformations on that data and then finally dump into a final BQ table. I had two approaches in mind for achieving this…