Spark Streaming Task Distribution

Asked Jun 05 '16 at 13:30

Active Jun 11 '16 at 13:29

Viewed 191 times

I have one spark streaming program that uses updateStateByKey. When I run it on a cluster with 3 machines, all updateStateByKey tasks (these are heavy tasks) run on one machine. This results in a scheduling delay on inputs, while other machines have idle CPU's.

I changed the number of workers but all updateStateByKey tasks run on workers on one specific machine. I also tried using mapWithState instead updateStateByKey but have the same problem.

How can I tell spark to preferentially distribute tasks across all machines?

    JavaPairInputDStream<String, String> streamEvents = KafkaUtils.createDirectStream(
            jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, eventTopic
    );

    JavaPairDStream<String, String> newKeys = streamEvents.flatMapToPair(new KeyGenerator(zookeeperHosts));

    JavaPairDStream<String, String> scenarioKeys =
            newKeys.updateStateByKey(new HeavyKeysFunc(), 36);

The jobs are queued and cpus are approximately idle.

edited Jun 11 '16 at 13:29

asked Jun 05 '16 at 13:30

Majid Hajibaba

3,105
6
23
55

1

Did you check the Spark UI to see how many partitions each RDD has? – Yuval Itzchakov Jun 05 '16 at 14:25
How are you creating your streaming job (streaming source)? Could you add the code? – maasg Jun 05 '16 at 17:15
@YuvalItzchakov Yes! i change up the number of partition to 36 but have the same problem. – Majid Hajibaba Jun 06 '16 at 06:28
@maasg I use `KafkaUtils.createDirectStream` that read from Kafka with 2 partition on a separate machines. – Majid Hajibaba Jun 06 '16 at 06:30
Add your code so we can attempt to help. – Yuval Itzchakov Jun 10 '16 at 10:08

Spark Streaming Task Distribution

0 Answers0