Questions tagged [apache-beam-internals]
29 questions
0
votes
2 answers
Exception while writing multipart empty csv file from Apache Beam into netApp Storage Grid
Problem Statement
We are consuming multiple csv files into pcollections -> Apply beam SQL to transform data -> write resulted pcollection.
This is working absolutely fine if we have some data in all the source pCollections and beam SQL generates new…

Jaysukh Kalasariya
- 73
- 1
- 7
0
votes
0 answers
Apache Beam Python - SQL Transform with named PCollection Issue
I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select.
As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA.
Instead of using PCOLLECTION in…

Murli Krishnan
- 35
- 5
0
votes
0 answers
Apache Beam - Multiple Pcollection - Dataframetransform Issue
I am running a below sample in apache beam
import apache_beam as beam
from apache_beam import Row
from apache_beam import Pipeline
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import…

Murli Krishnan
- 35
- 5
0
votes
1 answer
How to extract error records while inserting into db table using JDBCIO apache beam in java
I am creating in memory PCollection and writing it into postgres sql. now, when I insert data into table, few records may throw exception and will not be inserted. how to extract such failed insert records when I start pipeline?
below is the code I…

Sachin Rane
- 76
- 6
0
votes
1 answer
How to append new rows or perform union on tow PCollection
In the following CSV, I need to append new row values for it.
ID
date
balance
01
31/01/2021
100
01
28/02/2021
200
01
31/03/2021
200
01
30/04/2021
200
01
31/05/2021
500
01
30/06/2021
600
Expected…

Ron Santis
- 45
- 8
0
votes
1 answer
Beam DirectRunner Calcite can't specify name
I'm running a simplified version of this beam tutorial, but running it using the DirectRunner on my local machine.
import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform
import os
with beam.Pipeline() as p:
rows = (p |
…

steeling
- 183
- 1
- 1
- 9
0
votes
1 answer
Equivalent of repartition in apache beam
In spark, if we have to reshuffle the data, we can use repartition of a dataframe. What's the way to do the same in apache beam for a pcollection?
In pyspark,
new_df = df.repartition(4)

bigbounty
- 16,526
- 5
- 37
- 65
0
votes
1 answer
I see apache beam scales with # of csv files easiy but what about # lines in one csv?
I am currently reading this article and apache beam docs https://medium.com/@mohamed.t.esmat/apache-beam-bites-10b8ded90d4c
Every thing I have read is about N files. In our use case, we receive a pubsub event of ONE new file each time to kick off a…

Dean Hiller
- 19,235
- 25
- 129
- 212
0
votes
1 answer
File to DB load using Apache beam
I need to load a file into my database, but before that I have to verify data is present in the database based on some file data. For instance, suppose I have 5 records in a file then I have to check 5 times in the database for separate records.
So…

Gaurav Khandelwal
- 360
- 2
- 12
0
votes
1 answer
beam.Create() with list of dicts is extremely slow compared to a list of strings
I am using Dataflow to process a Shapefile with about 4 million features (about 2GB total) and load the geometries into BigQuery, so before my pipeline starts, I extract the shapefile features into a list, and initialize the pipeline using…

Travis Webb
- 14,688
- 7
- 55
- 109
0
votes
1 answer
Speed and memory tradeoffs splitting Apache Beam PCollection in two
I've got a PCollection where each element is a key, values tuple like this: (key, (value1,..,value_n) )
I need to split this PCollection in two processing branches.
As always, I need the whole pipeline to be as fast and use as little ram as…

Iñigo González
- 3,735
- 1
- 11
- 27
0
votes
2 answers
Know number of threads running in apache beam direct runner
I have an apache beam program in java running with direct runner. Apache beam uses threads in order to achieve distributed processing.
At run time how can I know the number of threads spawned by apache beam?
How can I set the maximum of number of…

bigbounty
- 16,526
- 5
- 37
- 65
0
votes
1 answer
Getting so many warning while using List with custom POJO Java class in apache beam java
I am new to Apache beam,I am using Apache beam and as runner using Dataflow in GCP.I am getting following error while executing pipeline.
coder of type class org.apache.beam.sdk.coders.ListCoder has a #structuralValue method which does not return…

akash kumar
- 35
- 9
0
votes
2 answers
TextIO.Read().From() vs TextIO.ReadFiles() over withHintMatchesManyFiles()
In my usecase getting set of matching filepattern from Kafka,
PCollection filepatterns = p.apply(KafkaIO.read()...);
Here each pattern could match upto 300+ files.
Q1. How can I use TextIO.Read() to match data from PCollection, as…

Prakhar Mishra
- 3
- 3