Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions

votes

1 answer

How to create Bigtable tables and column families in a Dataflow job if they doesn't exist

I have a Cloud Dataflow job which is writing everything to a single table and a single column family. How to modify this job to write to multiple tables and column families which may or may not exist? E.g., if a table or column family doesn't exist,…

google-cloud-dataflow apache-beam google-cloud-bigtable apache-beam-io

asked Feb 08 '19 at 13:24

Sanjay Setia

votes

0 answers

Connect Beam JDBC IO with Cassandra

Unable to connect Cassandra using jdbc driver getting error java.sql.SQLException: Cannot create PoolableConnectionFactory (isValid() returned false) Apache beam JDBC IO not working with Casandra I tried with working cassandra-jdbc-1.2.5.jar Here my…

java cassandra apache-beam apache-beam-io cassandra-jdbc

asked Feb 04 '19 at 11:30

Rahul Jyala

votes

1 answer

Dataflow IO to BigTable [2.9.0]

I found this Bigtable with Dataflow example https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/blob/master/java/dataflow-connector-examples/src/main/java/com/google/cloud/bigtable/dataflow/example/HelloWorldWrite.java However; it uses…

google-cloud-platform google-cloud-dataflow google-cloud-bigtable apache-beam-io

asked Jan 21 '19 at 17:03

brent

1,095
1
11
27

votes

0 answers

Beam pipeline: Kafka to HDFS by time buckets

I am trying to bake a very simple pipeline that reads a stream of events from Kafka (KafkaIO.read) and writes the very same events to HDFS, bucketing each event together by hour (the hour is read from a timestamp field of the event, not processing…

scala apache-kafka apache-flink apache-beam apache-beam-io

asked Jan 08 '19 at 14:39

Fabiano Francesconi

1,769
1
19
35

votes

2 answers

Access File inside the dataflow pipeline

I want to download certain files from to the temp location before the pipeline starts.The files .mmdb files which are to be read in the ParDo fucntion.The files are stored on Google Storage but the method consuming the .mmdb files requires them to…

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Jan 02 '19 at 12:06

user3777228

votes

0 answers

Fetching records based on batch size in apache beam

I have 100k records to be processed and I need to fetch 10k each time, process them and fetch another 10k until I process all the 100k records which I call as batch size to reduce the processing overhead each time by fetching all the records at…

batch-processing apache-beam apache-beam-io

asked Dec 19 '18 at 06:28

Poornima Jasti

votes

2 answers

Considering total max records from the user and processing it based on the batch size in apache beam

I am trying to read the records from the source based on the count of total max records to be processed which should be given by the user. Eg: Total Records in the source table is 1 million Total Max records to process are 100K I need to process…

apache-beam apache-beam-io

asked Dec 13 '18 at 05:45

Poornima Jasti

votes

1 answer

Populate TextIO write with API call

My question revolves around kicking off a API call to get the file prefix for TextIO output. Here is what I have now (and…

google-cloud-dataflow apache-beam apache-beam-io

asked Nov 23 '18 at 18:43

MTS57017

votes

1 answer

I know that presently order by is not there in BeamSQL is there any work around for it

This is present PCollection rec = rec_out.apply(BeamSql.query( "SELECT bnk_name,state_name,val from PCOLLECTION order by val desc limit 2")); But I need PCollection rec = rec_out.apply(BeamSql.query( "SELECT…

google-cloud-platform google-cloud-dataflow apache-beam apache-beam-io

asked Nov 23 '18 at 02:20

TEJASWAKUMAR

votes

1 answer

Avro "not open" exception when writing generic records using Apache Beam

I am using AvroIO.writeCustomTypeToGenericRecords() for writing generic records to GCS inside a streaming data flow job. For the first few minutes all seems to be working fine, however, after around 10 minutes, the job starts throwing…

avro apache-beam apache-beam-io

asked Nov 16 '18 at 09:44

W Khattak

votes

1 answer

Apache Beam - Exception caught and throwed even though program excuting continuosly. How to stop that process or handle in pipeline

i have a pipeline which fetch data mysql and used to transfer data to mongo db after running this pipeline with the below code, data fetched from mysql but unable to load to mongodb noSqlresult.apply(MongoDbIO.write().withUri(mongoUri) …

java mongodb apache-beam apache-beam-io

asked Nov 09 '18 at 09:33

jithu

votes

1 answer

Read multiple files at runtime (dataflow template)

I am trying to build a dataflow template. The goal is to read ValueProvider that will tell me what files to read. Then for each files I need to read and enrich data with the object. I have tried this: …

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Nov 05 '18 at 17:12

user2864342

votes

1 answer

Apache Beam Python wordcount example errors on Windows10

I am running Anaconda - conda virtual env with Python 2.7 I have followed Apache Beam Python SDK Quickstart When I run - 'python -m apache_beam.examples.wordcount --input C:\Users\simon_6dagkya\OneDrive\ProgrammingCore\Apache…

apache-beam apache-beam-io

asked Sep 30 '18 at 08:22

Simon Brain

votes

2 answers

Google Dataflow - How to specify the TextIO in java if writing to an On-prem server?

Google Dataflow - How to specify the TextIO if writing to an On-prem server from Dataflow? (Provided that the On-prem server is connected to GCP with Cloud…

google-cloud-dataflow apache-beam dataflow apache-beam-io

asked Sep 14 '18 at 01:40

Roshan Fernando

votes

1 answer

Reading bulk data from BigQuery using joins

I have a use case wherein I have to read selected data from BigQuery by applying left joins on 20 different BQ tables, apply transformations on that data and then finally dump into a final BQ table. I had two approaches in mind for achieving this…

google-bigquery google-cloud-dataflow apache-beam apache-beam-io

asked Sep 06 '18 at 15:24

rish0097

1,024
2
18
39

Prev 1 2 3

…

35 36 Next