1

I am able to calculate the mean word length per starting letter for the spark collection

val animals23 = sc.parallelize(List(("a","ant"), ("c","crocodile"), ("c","cheetah"), ("c","cat"), ("d","dolphin"), ("d","dog"), ("g","gnu"), ("l","leopard"), ("l","lion"), ("s","spider"), ("t","tiger"), ("w","whale")), 2)

either with

animals23.
    aggregateByKey((0,0))(
        (x, y) => (x._1 + y.length, x._2 + 1),
        (x, y) => (x._1 + y._1, x._2 + y._2)
    ).
    map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
    collect

or with

animals23.
    combineByKey(
        (x:String) => (x.length,1),
        (x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
        (x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)
    ).
    map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
    collect

each resulting in

Array((a,3.0), (c,6.333333333333333), (d,5.0), (g,3.0), (l,5.5), (w,5.0), (s,6.0), (t,5.0))

What I do not understand: Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?

I am talking about

(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2)

vs

(x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
(x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)

and it might be more a Scala than a Spark question.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
Make42
  • 12,236
  • 24
  • 79
  • 155

1 Answers1

2

Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?

Because in the first example, the compiler is able to infer the type of seqOp based on the first argument list supplied. aggregateByKey is using currying:

def aggregateByKey[U](zeroValue: U)
                     (seqOp: (U, V) ⇒ U, 
                      combOp: (U, U) ⇒ U)
                     (implicit arg0: ClassTag[U]): RDD[(K, U)]

The way type inference works in Scala, is that the compiler is able to infer the type of the second argument list based on the first. So in the first example, it knows that that seqOp is a function ((Int, Int), String) => (Int, Int), same goes for combOp.

On the contrary, combineByKey there's only a single argument list:

combineByKey[C](createCombiner: (V) ⇒ C, 
                mergeValue: (C, V) ⇒ C, 
                mergeCombiners: (C, C) ⇒ C): RDD[(K, C)] 

And without explicitly stating the types, the compiler doesn't know what to infer x and y to.

What you can do to help the compiler is to explicitly specify the type arguments:

animals23
  .combineByKey[(Int, Int)](x => (x.length,1), 
                           (x, y) => (x._1 + y.length, x._2 + 1),
                           (x, y) => (x._1 + y._1, x._2 + y._2))
  .map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble))
  .collect
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Couldn't the compiler work out from `animals23` that the `x` of `createCombiner` is a String and thus the result of it (with `x.length`) must be an Int and infer from there the types for the other two functions? How is currying (from the compiler's perspective, not the programmer's) different then a single argument list? I don't see why the compiler has been limited this way. – Make42 May 18 '16 at 21:38
  • @Make42 Theoretically yes, the compiler could infer type `C` from the previous provided function, but that's not how Scala's local type inference works. Type information flows across parameter lists and not within parameter lists. You can read a bit on that in [Programming in Scala](https://books.google.co.il/books?id=MFjNhTjeQKkC&pg=PA325&lpg=PA325&dq=curried+type+inference+scala&source=bl&ots=FMrhYEMPmv&sig=jGB32Uu-VTdrJ1hziUOmu-IsPTo&hl=iw&sa=X&ved=0ahUKEwjIu6PW0uTMAhXMJMAKHUYBB5EQ6AEIMzAH#v=onepage&q=curried%20type%20inference%20scala&f=false). – Yuval Itzchakov May 18 '16 at 22:31
  • @Make42 Also see [this ticket](https://issues.scala-lang.org/plugins/servlet/mobile#issue/SI-3293) for a rather weak explanation. The usual answer is "it's complicated". – Yuval Itzchakov May 18 '16 at 22:32
  • Well, for now I guess your comment "Type information flows across parameter lists and not within parameter lists." is good enough for me. – Make42 May 19 '16 at 11:07