Highest Voted 'apache-spark-sql-repartition' Questions

0

votes

1 answer

spark repartition issue for filesize

Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did repartition() on that and write to the hdfs…

asked Nov 24 '22 at 14:22

pavan kumar

1
1

0

votes

0 answers

Join 2 large size tables (50 Gb and 1 billion records)

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in : I need to tune it, as I am getting OOM errors due to Java heap space. I have to apply left join. There will not be any…

apache-spark apache-spark-sql apache-spark-2.0 apache-spark-sql-repartition parquet-dataset

asked Nov 21 '22 at 18:02

Red Maple

1

0

votes

1 answer

How to Increase Spark Repartition With Column Expressions Performance

I have a performance problem in repartition and partitionBy operation in Spark. My df is containing monthly data and i am partitioning data as daily with dailyDt column. My code is like below. First attempt This takes 3 minutes to finish, but many…

apache-spark partitioning partition apache-spark-sql-repartition

asked Nov 15 '22 at 19:48

gurbux

29
8

0

votes

1 answer

How to read parquet files using only one thread on a worker/task node?

In spark, if we execute the following command: spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`") .show(5,false) Spark distributes the read on all threads on a…

scala apache-spark apache-spark-sql apache-spark-sql-repartition

asked Nov 14 '22 at 22:55

sojim2

1,245
2
15
38

0

votes

1 answer

How to choose the optimal repartition value in spark

I have 3 input files File1 - 27gb File2 - 3gb File3 - 12mb My cluster configuration 2 executor Each executor has 2 cores Executor memory - 13gb (2gb overhead) The transformation that I'm going to perform is left join, in which the left table is…

apache-spark optimization pyspark apache-spark-sql apache-spark-sql-repartition

asked Sep 26 '22 at 20:47

Praveen Kumar

1

0

votes

0 answers

Is it still necessary to repartition spark dataframe after enabling AQE?

As I learned the spark AQE (Adaptive Query Execution) is taking care of the spark data frame partition dynamically at the runtime (if shuffling). Therefore do we still need to concern about "manually" repartition? And, does the processed data frame…

apache-spark apache-spark-sql apache-spark-sql-repartition

asked Sep 14 '22 at 07:55

QPeiran

1,108
1
8
18

0

votes

0 answers

Insert large amount of data in sql using pyspark sql connector

I have a Pyspark job which reads about 1M record from upstream data source and and tries to add it to SQL. I am using Pyspark 3.1 with Pyspark sql connector and when writing anything over 2K records into SQL it returns me the error that connection…

pyspark jdbc apache-spark-sql apache-spark-sql-repartition

asked Sep 07 '22 at 22:23

Eats

477
1
6
13

0

votes

0 answers

using repartion in pyspark for huge set of data

I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any partitions created. I need to read this data in…

apache-spark pyspark apache-spark-sql-repartition

asked Mar 30 '22 at 08:30

Sidhant Gupta

139
14

Questions tagged [apache-spark-sql-repartition]