2

consider the following code in Scalding:

Let's say I have the following tuples in a scalding TypedPipe[(Int, Int)]:

(1, 2)
(1, 3)
(2, 1)
(2, 2)

On this pipe I can call groupBy(t => t._1) to generate a Grouped[Int, (Int, Int)] , which will still represent the same data, but grouped by the 1st item of the tuple.

Now, let's say I sum the resulting object, so the total flow is like that:

def sumGroup(a : TypedPipe[(Int, Int)]) : Grouped[Int, (Int, Int)] =
    {
    a.groupBy(t => t._1).sum
    }

The result of doing this on the initial example would result in the following tuples:

(1, (2, 5))
(2  (4, 3))

And now we know for sure that there is only one item per key (for the key "1", we only have one resulting tuple) because this is the behavior of sum. However the type returned by sum is still Grouped[Int, (Int, Int)], which doesn't convey the fact that there can only be one item per key.

Is there a specific type like Grouped[K, V] that would convey the meaning that there is only one "V" value for a given "K" value ? If not, why is that?

It seems it could be useful to optimize joins when we can be sure that both sides exactly have one value per key.

lezebulon
  • 7,607
  • 11
  • 42
  • 73

0 Answers0