1

for a shuffle action, I see the data processed by the cores of the same executor is not balanced and of course the one takes the longest time will slow down the whole process time.

So I would like to know if it is possible to make some modification, so the data will be shared equally between the cores.

I use Spark 2.4 on aws emr and s3.

enter image description here

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
mingzhao.pro
  • 709
  • 1
  • 6
  • 20
  • There are plenty of solutions or workaround depending on what you're doing / trying to do. Can you post some code snippet so we can discuss on it ? Basically, you're dealing with skewness due to some shuffle operation. – baitmbarek Nov 23 '19 at 21:21
  • @baitmbarek Hi, thank you for your response. I'm new to the project and it is a huge one, I'm not sure I can post a part of the code, nor am I sure the part which causes the suffle and the "skewness". many transformation have been chained together. Is it possible you provide some solution so I can try on my side and get back to you? – mingzhao.pro Nov 24 '19 at 20:31

0 Answers0