2

I have one spark streaming program that uses updateStateByKey. When I run it on a cluster with 3 machines, all updateStateByKey tasks (these are heavy tasks) run on one machine. This results in a scheduling delay on inputs, while other machines have idle CPU's.

I changed the number of workers but all updateStateByKey tasks run on workers on one specific machine. I also tried using mapWithState instead updateStateByKey but have the same problem.

How can I tell spark to preferentially distribute tasks across all machines?

    JavaPairInputDStream<String, String> streamEvents = KafkaUtils.createDirectStream(
            jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, eventTopic
    );

    JavaPairDStream<String, String> newKeys = streamEvents.flatMapToPair(new KeyGenerator(zookeeperHosts));

    JavaPairDStream<String, String> scenarioKeys =
            newKeys.updateStateByKey(new HeavyKeysFunc(), 36);

The jobs are queued and cpus are approximately idle.

Majid Hajibaba
  • 3,105
  • 6
  • 23
  • 55

0 Answers0