I have one spark streaming program that uses updateStateByKey
.
When I run it on a cluster with 3 machines, all updateStateByKey
tasks (these are heavy tasks) run on one machine. This results in a scheduling delay on inputs, while other machines have idle CPU's.
I changed the number of workers but all updateStateByKey
tasks run on workers on one specific machine. I also tried using mapWithState
instead updateStateByKey
but have the same problem.
How can I tell spark to preferentially distribute tasks across all machines?
JavaPairInputDStream<String, String> streamEvents = KafkaUtils.createDirectStream(
jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, eventTopic
);
JavaPairDStream<String, String> newKeys = streamEvents.flatMapToPair(new KeyGenerator(zookeeperHosts));
JavaPairDStream<String, String> scenarioKeys =
newKeys.updateStateByKey(new HeavyKeysFunc(), 36);
The jobs are queued and cpus are approximately idle.