Questions tagged [apache-beam-internals]

29 questions
0
votes
2 answers

Exception while writing multipart empty csv file from Apache Beam into netApp Storage Grid

Problem Statement We are consuming multiple csv files into pcollections -> Apply beam SQL to transform data -> write resulted pcollection. This is working absolutely fine if we have some data in all the source pCollections and beam SQL generates new…
0
votes
0 answers

Apache Beam Python - SQL Transform with named PCollection Issue

I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select. As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA. Instead of using PCOLLECTION in…
0
votes
0 answers

Apache Beam - Multiple Pcollection - Dataframetransform Issue

I am running a below sample in apache beam import apache_beam as beam from apache_beam import Row from apache_beam import Pipeline from apache_beam.options.pipeline_options import PipelineOptions from apache_beam.options.pipeline_options import…
0
votes
1 answer

How to extract error records while inserting into db table using JDBCIO apache beam in java

I am creating in memory PCollection and writing it into postgres sql. now, when I insert data into table, few records may throw exception and will not be inserted. how to extract such failed insert records when I start pipeline? below is the code I…
0
votes
1 answer

How to append new rows or perform union on tow PCollection

In the following CSV, I need to append new row values for it. ID date balance 01 31/01/2021 100 01 28/02/2021 200 01 31/03/2021 200 01 30/04/2021 200 01 31/05/2021 500 01 30/06/2021 600 Expected…
0
votes
1 answer

Beam DirectRunner Calcite can't specify name

I'm running a simplified version of this beam tutorial, but running it using the DirectRunner on my local machine. import apache_beam as beam from apache_beam.transforms.sql import SqlTransform import os with beam.Pipeline() as p: rows = (p | …
steeling
  • 183
  • 1
  • 1
  • 9
0
votes
1 answer

Equivalent of repartition in apache beam

In spark, if we have to reshuffle the data, we can use repartition of a dataframe. What's the way to do the same in apache beam for a pcollection? In pyspark, new_df = df.repartition(4)
0
votes
1 answer

I see apache beam scales with # of csv files easiy but what about # lines in one csv?

I am currently reading this article and apache beam docs https://medium.com/@mohamed.t.esmat/apache-beam-bites-10b8ded90d4c Every thing I have read is about N files. In our use case, we receive a pubsub event of ONE new file each time to kick off a…
Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
0
votes
1 answer

File to DB load using Apache beam

I need to load a file into my database, but before that I have to verify data is present in the database based on some file data. For instance, suppose I have 5 records in a file then I have to check 5 times in the database for separate records. So…
0
votes
1 answer

beam.Create() with list of dicts is extremely slow compared to a list of strings

I am using Dataflow to process a Shapefile with about 4 million features (about 2GB total) and load the geometries into BigQuery, so before my pipeline starts, I extract the shapefile features into a list, and initialize the pipeline using…
Travis Webb
  • 14,688
  • 7
  • 55
  • 109
0
votes
1 answer

Speed and memory tradeoffs splitting Apache Beam PCollection in two

I've got a PCollection where each element is a key, values tuple like this: (key, (value1,..,value_n) ) I need to split this PCollection in two processing branches. As always, I need the whole pipeline to be as fast and use as little ram as…
0
votes
2 answers

Know number of threads running in apache beam direct runner

I have an apache beam program in java running with direct runner. Apache beam uses threads in order to achieve distributed processing. At run time how can I know the number of threads spawned by apache beam? How can I set the maximum of number of…
bigbounty
  • 16,526
  • 5
  • 37
  • 65
0
votes
1 answer

Getting so many warning while using List with custom POJO Java class in apache beam java

I am new to Apache beam,I am using Apache beam and as runner using Dataflow in GCP.I am getting following error while executing pipeline. coder of type class org.apache.beam.sdk.coders.ListCoder has a #structuralValue method which does not return…
0
votes
2 answers

TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()

In my usecase getting set of matching filepattern from Kafka, PCollection filepatterns = p.apply(KafkaIO.read()...); Here each pattern could match upto 300+ files. Q1. How can I use TextIO.Read() to match data from PCollection, as…
1
2