1

I have a dstream with tuples (String, Int) in it

When I try combineByKey, it says me to specify parameter: Partitioner

my_dstream.combineByKey(
      (v) => (v,1),
      (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
      (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
    )

However, when I use it on an rdd, it works correctly:

 my_dstream.foreachRDD( rdd =>
      rdd.combineByKey(
        (v) => (v,1),
        (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
        (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
      ))

Where can I get this partitioner ?

Vadym B.
  • 681
  • 7
  • 21
  • side question: any particular reason why you would like to do this instead of `dstream.map(e => (e,1)).reduceByKey(_+_)` ? – maasg Apr 03 '16 at 16:03
  • Specifically, I want to calculate multiple values grouped by key. So I need to use `combineByKey` instead of `reduceByKey` – Vadym B. Apr 04 '16 at 17:09

1 Answers1

1

Where can I get this partitioner ?

You can create it yourself. Spark comes out of the box with two partitioners: HashPartitioner and RangePartitioner. The default is the former. You can instantiate via it's constructor, you'll need to pass the number of desired partitions:

val numOfPartitions = // specify the amount you want
val hashPartitioner = new HashPartitioner(numOfPartitions)

my_dstream.combineByKey(
  (v) => (v,1),
  (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1),
  (acc1:(Int, Int), acc2:(Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2), 
                                        hashPartitioner
) 
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • I dont nkow how many partitions should I use. So: 1) can I avoid using partitioner on a Dstream? 2) if no, based on what should I choose number of partitions? – Vadym B. Apr 04 '16 at 17:05
  • 1
    @Vadym No, you can't avoid the partitioner. [This blog](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/) has a nice ballpark figure of how to partition your data – Yuval Itzchakov Apr 04 '16 at 17:16