I am able to calculate the mean word length per starting letter for the spark collection
val animals23 = sc.parallelize(List(("a","ant"), ("c","crocodile"), ("c","cheetah"), ("c","cat"), ("d","dolphin"), ("d","dog"), ("g","gnu"), ("l","leopard"), ("l","lion"), ("s","spider"), ("t","tiger"), ("w","whale")), 2)
either with
animals23.
aggregateByKey((0,0))(
(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2)
).
map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
collect
or with
animals23.
combineByKey(
(x:String) => (x.length,1),
(x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
(x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)
).
map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
collect
each resulting in
Array((a,3.0), (c,6.333333333333333), (d,5.0), (g,3.0), (l,5.5), (w,5.0), (s,6.0), (t,5.0))
What I do not understand: Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?
I am talking about
(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2)
vs
(x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
(x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)
and it might be more a Scala than a Spark question.