Questions tagged [spark-shuffle]
20 questions
0
votes
1 answer
Repartition on non-deterministic expression
I want to write code like this:
df.repartition(42, monotonically_increasing_id() / lit(10000))
Is this code going to break something due to non-determinatic expression in repartition? I understand that this code will turn into HashPartitioning…

evalgor
- 3
- 2
0
votes
1 answer
How wide transformations are influenced by shuffle partition config
How does wide transformations actually work based on shuffle partitions configuration?
If I have following program:
spark.conf.set("spark.sql.shuffle.partitions", "5")
val df = spark
.read
.option("inferSchema", "true")
.option("header",…

Mandroid
- 6,200
- 12
- 64
- 134
0
votes
2 answers
Spark NullPointerException: Cannot invoke invalidateSerializedMapOutputStatusCache() because "shuffleStatus" is null
I'm running a simple little Spark 3.3.0 pipeline on Windows 10 using Java 17 and UDFs. I hardly do anything interesting, and now when I run the pipeline on only 30,000 records I'm getting this:
[ERROR] Error in removing shuffle…

Garret Wilson
- 18,219
- 30
- 144
- 272
0
votes
1 answer
how to decide number of executors for 1 billion rows in spark
We have a table which has one billion three hundred and fifty-five million rows.
The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of…

Surendiran Balasubramanian
- 25
- 2
- 7
0
votes
0 answers
How to clear Spark temporary shuffle files between stages to avoid "no space left on device" error?
I am running a spark job on a AWS EMR 6.6, (Spark 3.2.0) however it seems that spark is writing a lot of data to disk. I always thought that spark was all in memory, but it appears that spark writes temporary files to disk each time there is a wide…

Mattreex
- 189
- 2
- 17