Questions tagged [apache-beam-io]

Apache Beam is an unified SDK for batch and stream processing. This tag should be used for questions related to reading data into an Apache Beam pipeline, or writing the output of a pipeline to a destination.

Apache Beam is an open source, unified model for defining and executing both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.

Apache Beam I/O refers to the process of loading data into a Beam pipeline, or writing the output of a pipeline to a destination.

References

Related Tags

539 questions
0
votes
1 answer

Module object has no attribute BigqueryV2 - Local Apache Beam

I am trying to run a pipeline locally (Sierra) with Apache Beam using beam provided I/O APIs for Google BigQuery. I settled up my environment using Virtualenv as suggested by Beam Python quickstart and I can run the wordcount.py example. I can also…
0
votes
1 answer

Using MySQL as input source and writing into Google BigQuery

I have an Apache Beam task that reads from a MySQL source using JDBC and it's supposed to write the data as it is to a BigQuery table. No transformation is performed at this point, that will come later on, for the moment I just want the database…
MC.
  • 481
  • 7
  • 15
0
votes
2 answers

Apache Beam maven dependencies: jdbc package is not downloaded in skd jar file

Downloaded maven dependecies in eclipse using org.apache.beam beam-runners-direct-java 0.3.0-incubating Only org.apache.beam.sdk.io,Only…
naga
  • 25
  • 1
  • 6
0
votes
1 answer

NullPointerException caught when writing to BigTable using Apache Beam's dataflow sdk

I'm using Apache's Beam sdk version 0.2.0-incubating-SNAPSHOT and trying to pull data to a bigtable with the Dataflow runner. Unfortunately I'm getting NullPointerException when executing my dataflow pipeline where I'm using BigTableIO.Write as my…
0
votes
1 answer

Dataflow pipeline details for BigQuery source/sinks not displaying

According to this announcement by the Google Dataflow team, we should be able to see the details of our BigQuery sources and sinks in the console if we use the 1.6 SDK. However, although the new "Pipeline Options" do indeed show up, the details of…
Graham Polley
  • 14,393
  • 4
  • 44
  • 80
-1
votes
0 answers

Flatten of Unbounded PCollections with Dataflow Runner v2 works only once

I'm generating the same type of objects in multiple transforms (they are events of my process). The input of the pipeline is FileIO.MatchAll so PCollections are unbounded. Then, I create PCollectionList and Flatten them so I can apply…
skalski
  • 111
  • 1
  • 6
-1
votes
1 answer

how to call the Apache beam DoFn Class from another general class or vice versa?

How can I inherit the ParDo Class (which is in beam.py) to generic Class (which is in generic.py file) or vice versa? example : beam.py class rejected_records(beam.DoFn): def process(self,element): """ Transformation """ return…
-1
votes
1 answer

apache_beam: No matching signature for operator = for argument types: DATE, INT64. Supported signature: ANY = ANY

I did some python coding with BQ SQL using apache_beam.io.gcp.bigquery_tools. What I am confused about is the SQL is working perfectly when I run in BQ, but it hits an error when I implement it to Python with the above apache_beam library. I also…
-1
votes
1 answer

Apache beam Text IO writer is not writing unbounded source to file

The following code runs without any issues in the beam direct runner. The sqs messages are consumed, but the messages aren't written into the destination location. Options options =…
-1
votes
1 answer

Apache Beam KafkaIO mention topic partition instead of topic name

Apache Beam KafkaIO has support for kafka consumers to read only from specified partitions. I have the following code. KafkaIO.read() .withCreateTime(Duration.standardMinutes(1)) .withReadCommitted() …
bigbounty
  • 16,526
  • 5
  • 37
  • 65
-1
votes
1 answer

BigQueryIO.writeTableRows writes to BigQuery with very high delay

The following code snippet shows the writing method to BigQuery (it picks up data from PubSub). The "Write to BigQuery" dataflow step receives the TableRow data but it writes to BigQuery with very high delay (more than 3-4 hours) or doesn't even…
-1
votes
1 answer

Apache Beam - What happens with Windows/Triggers after multiple GroupByKey?

The windowing section of the Beam programming model guide shows a window defined and used in the GroupyByKey transform after a ParDo. (section 7.1.1). How long does a window remain in scope for an element? Let's imagine a pipeline like…
Pablo
  • 10,425
  • 1
  • 44
  • 67
-2
votes
1 answer

Apache Beam FileIO match - What's better/more efficient way to match files?

I'm just wondering - does the use of wildcard have an impact on how Beam matches files? For instance, if I want to match a file with Apache Beam, is there an advantage if I'd specify a direct path to a file (i.e. folder/subfolder/file.txt). Or, if…
-2
votes
1 answer

Convert this Weigth/Score into List of Coulmn name with sorted according to their Weigth/Score Matrix Format using Python

Convert this Weight/Score reading from an input .csv file into List of Column name with sorted according to their descending Weight/Score Matrix Format using Python Apache Beam and write into the another .csv file Input .csv file user_id,…
1 2 3
35
36