I have a use case where input stream data is skewed, volume of data can be from 0 events to 50,000 events per batch. Each data entry is independent of others. Therefore to avoid shuffle caused by repartitioning I want to use some kind of dynamic repartitioning based on the batch size. I cannot get size of the batch using dstream count.
My use case is very simple I have unknown volume of data coming into the spark stereaming process, that I want to process in parallel and save to a text file. I want to run this data in parallel therefore I am using repartition which has introduced shuffle. I want to avoid shuffle due to repartition.
I want to what is the recommended approach to solve data skewed application in spark streaming.