Questions tagged [apache-spark-sql-repartition]
23 questions
0
votes
1 answer
spark repartition issue for filesize
Need to merge small parquet files.
I have multiple small parquet files in hdfs.
I like to combine those parquet files each to nearly 128 mb each
2. So I read all the files using spark.read()
And did repartition() on that and write to the hdfs…

pavan kumar
- 1
- 1
0
votes
0 answers
Join 2 large size tables (50 Gb and 1 billion records)
I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :
I need to tune it, as I am getting OOM errors due to Java heap space.
I have to apply left join.
There will not be any…
0
votes
1 answer
How to Increase Spark Repartition With Column Expressions Performance
I have a performance problem in repartition and partitionBy operation in Spark.
My df is containing monthly data and i am partitioning data as daily with dailyDt column. My code is like below.
First attempt
This takes 3 minutes to finish, but many…

gurbux
- 29
- 8
0
votes
1 answer
How to read parquet files using only one thread on a worker/task node?
In spark, if we execute the following command:
spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
.show(5,false)
Spark distributes the read on all threads on a…

sojim2
- 1,245
- 2
- 15
- 38
0
votes
1 answer
How to choose the optimal repartition value in spark
I have 3 input files
File1 - 27gb
File2 - 3gb
File3 - 12mb
My cluster configuration
2 executor
Each executor has 2 cores
Executor memory - 13gb (2gb overhead)
The transformation that I'm going to perform is left join, in which the left table is…
0
votes
0 answers
Is it still necessary to repartition spark dataframe after enabling AQE?
As I learned the spark AQE (Adaptive Query Execution) is taking care of the spark data frame partition dynamically at the runtime (if shuffling).
Therefore do we still need to concern about "manually" repartition?
And, does the processed data frame…

QPeiran
- 1,108
- 1
- 8
- 18
0
votes
0 answers
Insert large amount of data in sql using pyspark sql connector
I have a Pyspark job which reads about 1M record from upstream data source and and tries to add it to SQL. I am using Pyspark 3.1 with Pyspark sql connector and when writing anything over 2K records into SQL it returns me the error that connection…

Eats
- 477
- 1
- 6
- 13
0
votes
0 answers
using repartion in pyspark for huge set of data
I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any partitions created. I need to read this data in…

Sidhant Gupta
- 139
- 14