0

Hi I have the following data and want to map it to the first item in the second parameter. So for:

1 -> List((1,11))
1 -> List((1,1), (1,111))

I want:

(1,11)
(1,1)

When this data is in an RDD I can do the following:

scala> val m = sc.parallelize(Seq(11 -> List((1,11)), 1 -> List((1,1),(1,111))))
m: org.apache.spark.rdd.RDD[(Int, List[(Int, Int)])] = ParallelCollectionRDD[198] at parallelize at <console>:47

scala> m.map(_._2.head).collect.foreach(println)
(1,11)
(1,1)

However, when it is in a Map object (result of a groupBy) I get the following:

scala> val m = Map(11 -> List((1,11)), 1 -> List((1,1)))
m: scala.collection.immutable.Map[Int,List[(Int, Int)]] = Map(11 -> List((1,11)), 1 -> List((1,1), (1,111)))

scala> m.map(_._2.head)
res1: scala.collection.immutable.Map[Int,Int] = Map(1 -> 1)

When I map to the whole list I get what I would expect, but not when I call head on it

scala> m.map(_._2)
res2: scala.collection.immutable.Iterable[List[(Int, Int)]] = List(List((1,11)), List((1,1), (1,111)))

I can also get the result I want if I do either of the following:

scala> m.map(_._2).map(_.head)
res4: scala.collection.immutable.Iterable[(Int, Int)] = List((1,11), (1,1))

scala> m.values.map(_.head)
res5: Iterable[(Int, Int)] = List((1,11), (1,1))

Could someone explain please what is going on here?

Mikel San Vicente
  • 3,831
  • 2
  • 21
  • 39
Breandán
  • 1,855
  • 22
  • 34

2 Answers2

5

This is a bit tricky and depends on the implicit parameter CanBuildFrom of the map. Depending the output type of your function f it will be able to build one structure or the other (it will be built by the implicit CanBuildFrom).

m.map(_._2.head) // The passed function retrieves a pair (Int, Int)

There is a implicit CanBuildFrom from (A, A) to Map[A,A], that object is passed to your map implicitly, that's why in that case the the returned object is Map[Int, Int]

On the other case you have

m.map(_._2) // The passed function retrieves a List[(Int, Int)]

The implicit CanBuildFrom for List[A] will build a Iterable[A], in this case from List[(Int, Int)] to Iterable[(Int, Int)]

Mikel San Vicente
  • 3,831
  • 2
  • 21
  • 39
3

The map operation on a scala.collection.immutable.Map behave differently depending on the return type of the map operation.

When the return type if of Type Tuple2[T,P]:

the output of the Map operation results in an another Map with the first element of the tuple _1 as the key and the second element _2 as the value.

for example

scala> m.map(_ => 10 -> 1)
res14: scala.collection.immutable.Map[Int,Int] = Map(10 -> 1) // note the return type is Map.

When the return type is anything other than Tuple2:

when the return type is anything other than Tuple2 then output of the map operation is a list.

scala> m.map(_ => 10 )
res15: scala.collection.immutable.Iterable[Int] = List(10, 10) // note that the return type now is a List.

so with the above established fact, for a Map of value Map(11 -> List((1,11)), 1 -> List((1,1))) the operation m.map(_._2.head) produces Tuple2 values (1, 11) and (1,1). since the first value (_1) of each Tuple2 item is 1 (i.e. the key of each value is 1), the (1,1) overwrites (1,11) and we end up with a single value of (1,1).

In other cases the map operation doesnt return types of Tuple2 and hence it results in List type instead of Maptypes hence the difference is results.

rogue-one
  • 11,259
  • 7
  • 53
  • 75