combineByKey on a Dstream throws an error

Question

I have a dstream with tuples (String, Int) in it

When I try combineByKey, it says me to specify parameter: Partitioner

my_dstream.combineByKey(
      (v) => (v,1),
      (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
      (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
    )

However, when I use it on an rdd, it works correctly:

 my_dstream.foreachRDD( rdd =>
      rdd.combineByKey(
        (v) => (v,1),
        (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
        (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
      ))

Where can I get this partitioner ?

side question: any particular reason why you would like to do this instead of `dstream.map(e => (e,1)).reduceByKey(_+_)` ? — maasg, Apr 03 '16 at 16:03
Specifically, I want to calculate multiple values grouped by key. So I need to use `combineByKey` instead of `reduceByKey` — Vadym B., Apr 04 '16 at 17:09

Yuval Itzchakov · Accepted Answer · 2016-04-03T11:25:49.587

1

Where can I get this partitioner ?

You can create it yourself. Spark comes out of the box with two partitioners: HashPartitioner and RangePartitioner. The default is the former. You can instantiate via it's constructor, you'll need to pass the number of desired partitions:

val numOfPartitions = // specify the amount you want
val hashPartitioner = new HashPartitioner(numOfPartitions)

my_dstream.combineByKey(
  (v) => (v,1),
  (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
  (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2), 
                                        hashPartitioner
)

edited Apr 03 '16 at 11:25

answered Apr 01 '16 at 15:53

Yuval Itzchakov

146,575
32
257
321

I dont nkow how many partitions should I use. So: 1) can I avoid using partitioner on a Dstream? 2) if no, based on what should I choose number of partitions? – Vadym B. Apr 04 '16 at 17:05
1

@Vadym No, you can't avoid the partitioner. [This blog](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/) has a nice ballpark figure of how to partition your data – Yuval Itzchakov Apr 04 '16 at 17:16

combineByKey on a Dstream throws an error

1 Answers1