How to enable dynamic repartitioning in Spark Streaming for uneven data load

Asked Oct 01 '16 at 02:13

Active Oct 01 '16 at 07:51

Viewed 430 times

I have a use case where input stream data is skewed, volume of data can be from 0 events to 50,000 events per batch. Each data entry is independent of others. Therefore to avoid shuffle caused by repartitioning I want to use some kind of dynamic repartitioning based on the batch size. I cannot get size of the batch using dstream count.

My use case is very simple I have unknown volume of data coming into the spark stereaming process, that I want to process in parallel and save to a text file. I want to run this data in parallel therefore I am using repartition which has introduced shuffle. I want to avoid shuffle due to repartition.

I want to what is the recommended approach to solve data skewed application in spark streaming.

edited Oct 01 '16 at 07:51

asked Oct 01 '16 at 02:13

Alchemist

You cannot avoid shuffles when repartitioning data. These are basically synonyms. What input source do you use? – zero323 Oct 01 '16 at 13:11
Thanks so much for your response. I am using Kafka as input source. – Alchemist Oct 01 '16 at 17:05
With direct approach? – zero323 Oct 01 '16 at 17:08
Yes but got lots of issues when we tried to manage our own offset for recovery using direct approach and repartitions. So now we have direct api straight from spark doc. No offset mgmt etc. – Alchemist Oct 01 '16 at 17:09
I was thinking if I somehow increase repartition size based on the input data size then I can avoid shuffle. – Alchemist Oct 01 '16 at 17:17

How to enable dynamic repartitioning in Spark Streaming for uneven data load

0 Answers0